Skip to content

Track cloud-controller-manager entrypoint drift from upstream #10273

@nilo19

Description

@nilo19

/kind cleanup

What would you like to be added:

Track and close the known gaps between Azure's cloud-controller-manager entrypoint and upstream k8s.io/cloud-provider/app/controllermanager.go, while preserving Azure's dynamic reloading behavior.

Azure currently pins k8s.io/cloud-provider v0.35.0, but cmd/cloud-controller-manager/app/controllermanager.go carries a local fork of upstream command construction, serving, leader election, controller startup, and controller context setup. The goal is to migrate toward upstream-owned code where possible and make the remaining Azure-owned boundary explicit.

Required Azure behavior that must be preserved:

  • Dynamic cloud config reloading must remain supported.
  • A reload must restart the active controller generation without releasing the current leader lease.
  • File-backed and secret-backed cloud config loading must continue to work.
  • Azure-specific node IPAM, node filtering, tracing, Azure SDK metrics, and Azure cloud initialization must continue to work.

Observed gaps:

  1. Command construction is locally forked instead of using upstream builder.

    • Upstream has app.CommandBuilder with provider hooks for options, cloud initializer, controller registration, webhooks, additional flags, provider defaults, command name, and stop channel.
    • Azure directly constructs the Cobra command in cmd/cloud-controller-manager/app/controllermanager.go.
    • Migration gap: Azure cannot call upstream BuildCommand() directly until dynamic reload and Azure-specific setup have a defined hook/boundary.
  2. Dynamic reload exists only in Azure.

    • Azure has RunWrapper, runAsync, file/secret watchers, and shouldDisableCloudProvider.
    • Upstream starts controllers once per process/leader lifecycle and has no reload supervisor.
    • Migration gap: direct use of upstream Run() would drop dynamic reloading.
  3. Reload currently depends on staying inside OnStartedLeading.

    • Azure invokes RunWrapper from the leader elector's OnStartedLeading callback.
    • On config update, Azure cancels the active controller context and starts a new generation without returning from OnStartedLeading.
    • Migration gap: the reusable upstream lifecycle does not expose a generation-level restart hook under the same leader callback.
  4. Static mode has a latent fallthrough bug.

    • In Azure RunWrapper, the non-dynamic path calls Run(ctx, c.Complete(), h) but does not return afterward.
    • That means static mode can fall through into dynamic watcher setup.
    • Migration gap: this should be fixed while extracting the reload supervisor.
  5. Leader migration runtime behavior is missing in Azure.

    • The generic options expose --enable-leader-migration and --leader-migration-config.
    • Upstream checks leadermigration.Enabled, creates a LeaderMigrator, starts the main lock for non-migrated controllers, starts the migration lock for migrated controllers, and filters controller initializers.
    • Azure never checks leadermigration.Enabled, never creates the migration lock, and never filters migrated/non-migrated controllers.
    • Migration gap: Azure currently has the flag/config surface without the corresponding runtime behavior.
  6. Leader election health checker wiring differs.

    • Upstream creates one HealthzAdaptor and passes the same object both to health checks and to leaderElectAndRun.
    • Azure StartHTTPServer creates an election health checker for /healthz, then the command creates a separate checker for leaderelection.RunOrDie.
    • Migration gap: the health endpoint is not wired to the same leader-election watchdog object that observes the real leader election.
  7. Serving lifecycle is split differently.

    • Upstream starts event broadcasting, configz, webhook serving, healthz/metrics serving, leader election, and controller startup inside Run().
    • Azure starts HTTP serving before leader election and before controller generations so serving can remain process-scoped across reloads.
    • Migration gap: we need to separate process-scoped serving from generation-scoped controller startup before reusing upstream logic safely.
  8. Event broadcaster handling diverges.

    • Upstream stores EventBroadcaster in config, starts structured logging and recording in Run(), and shuts it down with defer.
    • Azure stores only EventRecorder; createRecorder creates a broadcaster, starts logging/recording, and drops the broadcaster reference.
    • Migration gap: reload can create new recorders/broadcasters without a process-scoped shutdown path, and Azure misses upstream structured event broadcaster behavior.
  9. Configz behavior diverges.

    • Upstream current master converts internal config to external v1alpha1, sets the GVK, and returns errors if configz registration or setting fails.
    • Azure registers internal ComponentConfig, logs registration errors, and does this inside each controller generation.
    • Migration gap: dynamic reload can repeatedly call configz.New(ConfigzName) and fail on already-registered configz; configz may not reflect the active reloaded generation.
  10. Webhook support is missing locally.

    • Upstream has webhook options, webhook secure serving, WebhookHandler, AllWebhooks, DisabledByDefaultWebhooks, and CloudControllerManagerWebhook feature-gated serving.
    • Azure config/options/entrypoint do not carry this upstream webhook path.
    • Migration gap: even if Azure does not need webhooks today, this is part of the upstream entrypoint surface that remains forked.
  11. Metrics wiring differs.

    • Upstream installs controller-manager SLI metrics on the base handler and registers controller-manager metrics in CreateControllerContext.
    • Azure replaces /metrics with Azure trace/OpenTelemetry metrics and /metrics/v2, but does not call controllersmetrics.Register() in its CreateControllerContext.
    • Migration gap: decide whether this is intentional Azure behavior or missing upstream metric registration.
  12. Cloud initialization happens at different lifecycle levels.

    • Upstream receives a cloudprovider.Interface from a provider-supplied cloud initializer before calling Run().
    • Azure initializes Azure cloud inside each Run() generation from either cloud config file or secret, then checks HasClusterID.
    • Migration gap: dynamic reload needs a generation-scoped cloud factory; upstream currently assumes process/run-scoped cloud construction.
  13. Controller initializer API is locally forked.

    • Upstream uses ControllerInitFuncConstructor, ConstructControllerInitializers, and DefaultInitFuncConstructors.
    • Azure uses a private initFunc signature and hardcoded newControllerInitializers().
    • Migration gap: Azure controllers need adapter wrappers before they can use upstream initializer construction.
  14. Controller set differs.

    • Upstream default controllers are cloud-node, cloud-node-lifecycle, service-lb, and node-route.
    • Azure also starts node-ipam and carries Azure-specific node IPAM options.
    • Migration gap: upstream reuse must preserve Azure's extra node IPAM controller.
  15. Node filtering is Azure-specific.

    • Azure supports --node-filter-requirements and builds filtered informer factories.
    • Upstream builds the standard shared informer factory.
    • Migration gap: upstream CreateControllerContext cannot be reused directly without preserving filtered informer construction.
  16. Options/config shape differs.

    • Upstream stores kubeconfig under Generic.ClientConnection.Kubeconfig, carries webhook options, provider defaults, NodeControllerOptions, and EventBroadcaster.
    • Azure carries a separate Kubeconfig field, dynamic reload options, node IPAM options, node filtering, and no webhook/event broadcaster fields.
    • Migration gap: adopting upstream options/config requires explicit mapping for Azure-only options and any upstream fields Azure currently lacks.
  17. Stop/shutdown model differs.

    • Upstream accepts a process stopCh, validates that the command gets no positional args, and wires stopCh through serving, leader election wait, informers, and controllers.
    • Azure uses cmd.Context() for HTTP serving but context.TODO() for leader election and RunWrapper.
    • Migration gap: moving toward upstream lifecycle should also decide how Azure wants process cancellation and graceful shutdown to work.
  18. Current upstream master has additional drift beyond the pinned module.

    • Compared with k8s.io/cloud-provider v0.35.0, upstream master moved flag printing after logging setup, tightened configz conversion/error handling, and applies node controller options through NodeController.ApplyTo.
    • Migration gap: the target should be explicit: either migrate to the pinned module behavior first, or track current upstream master behavior where compatible.

Proposed direction:

  1. Classify each difference as:

    • required Azure behavior,
    • upstream behavior Azure should adopt,
    • unsupported upstream feature that should be explicitly disabled/validated,
    • or follow-up work.
  2. Extract an Azure-owned dynamic reload supervisor boundary:

    • process-scoped setup: logging, tracing, metrics serving, healthz/readyz, configz, event broadcaster, leader election;
    • generation-scoped setup: load cloud config, create cloud, create controller context, start controllers.
  3. Preserve leader-lease continuity:

    • the supervisor must run inside OnStartedLeading;
    • reload must cancel only the current controller generation;
    • reload must not release the leader lock or re-run leader election.
  4. Handle leader migration explicitly:

    • either implement upstream dual-lock/filtering semantics with dynamic reload support,
    • or reject/disable the leader migration flags until support exists.
  5. Add tests covering:

    • static mode does not enter watcher setup;
    • dynamic reload restarts controller generation without returning from OnStartedLeading;
    • leader election healthz uses the same checker passed to the leader elector;
    • configz/event broadcaster/health checks do not leak or duplicate across reload generations;
    • leader migration flags are either functional or rejected with a clear validation error.

Why is this needed:

The local Azure entrypoint has drifted enough from upstream that routine upstream fixes are easy to miss. Some drift is required for Azure, especially dynamic reloading, but other gaps are accidental or under-specified:

  • leader migration flags are exposed but not implemented;
  • leader election health checking appears to use a different checker than the active leader elector;
  • static mode can fall through into dynamic reload setup;
  • configz and health checks are generation-scoped even though serving is process-scoped;
  • event broadcaster lifecycle differs from upstream and may not be reload-safe.

Tracking the gaps in one issue gives reviewers a concrete checklist for migrating toward upstream code without accidentally regressing Azure's dynamic reload feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions