/kind cleanup
What would you like to be added:
Track and close the known gaps between Azure's cloud-controller-manager entrypoint and upstream k8s.io/cloud-provider/app/controllermanager.go, while preserving Azure's dynamic reloading behavior.
Azure currently pins k8s.io/cloud-provider v0.35.0, but cmd/cloud-controller-manager/app/controllermanager.go carries a local fork of upstream command construction, serving, leader election, controller startup, and controller context setup. The goal is to migrate toward upstream-owned code where possible and make the remaining Azure-owned boundary explicit.
Required Azure behavior that must be preserved:
- Dynamic cloud config reloading must remain supported.
- A reload must restart the active controller generation without releasing the current leader lease.
- File-backed and secret-backed cloud config loading must continue to work.
- Azure-specific node IPAM, node filtering, tracing, Azure SDK metrics, and Azure cloud initialization must continue to work.
Observed gaps:
-
Command construction is locally forked instead of using upstream builder.
- Upstream has
app.CommandBuilder with provider hooks for options, cloud initializer, controller registration, webhooks, additional flags, provider defaults, command name, and stop channel.
- Azure directly constructs the Cobra command in
cmd/cloud-controller-manager/app/controllermanager.go.
- Migration gap: Azure cannot call upstream
BuildCommand() directly until dynamic reload and Azure-specific setup have a defined hook/boundary.
-
Dynamic reload exists only in Azure.
- Azure has
RunWrapper, runAsync, file/secret watchers, and shouldDisableCloudProvider.
- Upstream starts controllers once per process/leader lifecycle and has no reload supervisor.
- Migration gap: direct use of upstream
Run() would drop dynamic reloading.
-
Reload currently depends on staying inside OnStartedLeading.
- Azure invokes
RunWrapper from the leader elector's OnStartedLeading callback.
- On config update, Azure cancels the active controller context and starts a new generation without returning from
OnStartedLeading.
- Migration gap: the reusable upstream lifecycle does not expose a generation-level restart hook under the same leader callback.
-
Static mode has a latent fallthrough bug.
- In Azure
RunWrapper, the non-dynamic path calls Run(ctx, c.Complete(), h) but does not return afterward.
- That means static mode can fall through into dynamic watcher setup.
- Migration gap: this should be fixed while extracting the reload supervisor.
-
Leader migration runtime behavior is missing in Azure.
- The generic options expose
--enable-leader-migration and --leader-migration-config.
- Upstream checks
leadermigration.Enabled, creates a LeaderMigrator, starts the main lock for non-migrated controllers, starts the migration lock for migrated controllers, and filters controller initializers.
- Azure never checks
leadermigration.Enabled, never creates the migration lock, and never filters migrated/non-migrated controllers.
- Migration gap: Azure currently has the flag/config surface without the corresponding runtime behavior.
-
Leader election health checker wiring differs.
- Upstream creates one
HealthzAdaptor and passes the same object both to health checks and to leaderElectAndRun.
- Azure
StartHTTPServer creates an election health checker for /healthz, then the command creates a separate checker for leaderelection.RunOrDie.
- Migration gap: the health endpoint is not wired to the same leader-election watchdog object that observes the real leader election.
-
Serving lifecycle is split differently.
- Upstream starts event broadcasting, configz, webhook serving, healthz/metrics serving, leader election, and controller startup inside
Run().
- Azure starts HTTP serving before leader election and before controller generations so serving can remain process-scoped across reloads.
- Migration gap: we need to separate process-scoped serving from generation-scoped controller startup before reusing upstream logic safely.
-
Event broadcaster handling diverges.
- Upstream stores
EventBroadcaster in config, starts structured logging and recording in Run(), and shuts it down with defer.
- Azure stores only
EventRecorder; createRecorder creates a broadcaster, starts logging/recording, and drops the broadcaster reference.
- Migration gap: reload can create new recorders/broadcasters without a process-scoped shutdown path, and Azure misses upstream structured event broadcaster behavior.
-
Configz behavior diverges.
- Upstream current master converts internal config to external
v1alpha1, sets the GVK, and returns errors if configz registration or setting fails.
- Azure registers internal
ComponentConfig, logs registration errors, and does this inside each controller generation.
- Migration gap: dynamic reload can repeatedly call
configz.New(ConfigzName) and fail on already-registered configz; configz may not reflect the active reloaded generation.
-
Webhook support is missing locally.
- Upstream has webhook options, webhook secure serving,
WebhookHandler, AllWebhooks, DisabledByDefaultWebhooks, and CloudControllerManagerWebhook feature-gated serving.
- Azure config/options/entrypoint do not carry this upstream webhook path.
- Migration gap: even if Azure does not need webhooks today, this is part of the upstream entrypoint surface that remains forked.
-
Metrics wiring differs.
- Upstream installs controller-manager SLI metrics on the base handler and registers controller-manager metrics in
CreateControllerContext.
- Azure replaces
/metrics with Azure trace/OpenTelemetry metrics and /metrics/v2, but does not call controllersmetrics.Register() in its CreateControllerContext.
- Migration gap: decide whether this is intentional Azure behavior or missing upstream metric registration.
-
Cloud initialization happens at different lifecycle levels.
- Upstream receives a
cloudprovider.Interface from a provider-supplied cloud initializer before calling Run().
- Azure initializes Azure cloud inside each
Run() generation from either cloud config file or secret, then checks HasClusterID.
- Migration gap: dynamic reload needs a generation-scoped cloud factory; upstream currently assumes process/run-scoped cloud construction.
-
Controller initializer API is locally forked.
- Upstream uses
ControllerInitFuncConstructor, ConstructControllerInitializers, and DefaultInitFuncConstructors.
- Azure uses a private
initFunc signature and hardcoded newControllerInitializers().
- Migration gap: Azure controllers need adapter wrappers before they can use upstream initializer construction.
-
Controller set differs.
- Upstream default controllers are cloud-node, cloud-node-lifecycle, service-lb, and node-route.
- Azure also starts
node-ipam and carries Azure-specific node IPAM options.
- Migration gap: upstream reuse must preserve Azure's extra node IPAM controller.
-
Node filtering is Azure-specific.
- Azure supports
--node-filter-requirements and builds filtered informer factories.
- Upstream builds the standard shared informer factory.
- Migration gap: upstream
CreateControllerContext cannot be reused directly without preserving filtered informer construction.
-
Options/config shape differs.
- Upstream stores kubeconfig under
Generic.ClientConnection.Kubeconfig, carries webhook options, provider defaults, NodeControllerOptions, and EventBroadcaster.
- Azure carries a separate
Kubeconfig field, dynamic reload options, node IPAM options, node filtering, and no webhook/event broadcaster fields.
- Migration gap: adopting upstream options/config requires explicit mapping for Azure-only options and any upstream fields Azure currently lacks.
-
Stop/shutdown model differs.
- Upstream accepts a process
stopCh, validates that the command gets no positional args, and wires stopCh through serving, leader election wait, informers, and controllers.
- Azure uses
cmd.Context() for HTTP serving but context.TODO() for leader election and RunWrapper.
- Migration gap: moving toward upstream lifecycle should also decide how Azure wants process cancellation and graceful shutdown to work.
-
Current upstream master has additional drift beyond the pinned module.
- Compared with
k8s.io/cloud-provider v0.35.0, upstream master moved flag printing after logging setup, tightened configz conversion/error handling, and applies node controller options through NodeController.ApplyTo.
- Migration gap: the target should be explicit: either migrate to the pinned module behavior first, or track current upstream master behavior where compatible.
Proposed direction:
-
Classify each difference as:
- required Azure behavior,
- upstream behavior Azure should adopt,
- unsupported upstream feature that should be explicitly disabled/validated,
- or follow-up work.
-
Extract an Azure-owned dynamic reload supervisor boundary:
- process-scoped setup: logging, tracing, metrics serving, healthz/readyz, configz, event broadcaster, leader election;
- generation-scoped setup: load cloud config, create cloud, create controller context, start controllers.
-
Preserve leader-lease continuity:
- the supervisor must run inside
OnStartedLeading;
- reload must cancel only the current controller generation;
- reload must not release the leader lock or re-run leader election.
-
Handle leader migration explicitly:
- either implement upstream dual-lock/filtering semantics with dynamic reload support,
- or reject/disable the leader migration flags until support exists.
-
Add tests covering:
- static mode does not enter watcher setup;
- dynamic reload restarts controller generation without returning from
OnStartedLeading;
- leader election healthz uses the same checker passed to the leader elector;
- configz/event broadcaster/health checks do not leak or duplicate across reload generations;
- leader migration flags are either functional or rejected with a clear validation error.
Why is this needed:
The local Azure entrypoint has drifted enough from upstream that routine upstream fixes are easy to miss. Some drift is required for Azure, especially dynamic reloading, but other gaps are accidental or under-specified:
- leader migration flags are exposed but not implemented;
- leader election health checking appears to use a different checker than the active leader elector;
- static mode can fall through into dynamic reload setup;
- configz and health checks are generation-scoped even though serving is process-scoped;
- event broadcaster lifecycle differs from upstream and may not be reload-safe.
Tracking the gaps in one issue gives reviewers a concrete checklist for migrating toward upstream code without accidentally regressing Azure's dynamic reload feature.
/kind cleanup
What would you like to be added:
Track and close the known gaps between Azure's cloud-controller-manager entrypoint and upstream
k8s.io/cloud-provider/app/controllermanager.go, while preserving Azure's dynamic reloading behavior.Azure currently pins
k8s.io/cloud-provider v0.35.0, butcmd/cloud-controller-manager/app/controllermanager.gocarries a local fork of upstream command construction, serving, leader election, controller startup, and controller context setup. The goal is to migrate toward upstream-owned code where possible and make the remaining Azure-owned boundary explicit.Required Azure behavior that must be preserved:
Observed gaps:
Command construction is locally forked instead of using upstream builder.
app.CommandBuilderwith provider hooks for options, cloud initializer, controller registration, webhooks, additional flags, provider defaults, command name, and stop channel.cmd/cloud-controller-manager/app/controllermanager.go.BuildCommand()directly until dynamic reload and Azure-specific setup have a defined hook/boundary.Dynamic reload exists only in Azure.
RunWrapper,runAsync, file/secret watchers, andshouldDisableCloudProvider.Run()would drop dynamic reloading.Reload currently depends on staying inside
OnStartedLeading.RunWrapperfrom the leader elector'sOnStartedLeadingcallback.OnStartedLeading.Static mode has a latent fallthrough bug.
RunWrapper, the non-dynamic path callsRun(ctx, c.Complete(), h)but does not return afterward.Leader migration runtime behavior is missing in Azure.
--enable-leader-migrationand--leader-migration-config.leadermigration.Enabled, creates aLeaderMigrator, starts the main lock for non-migrated controllers, starts the migration lock for migrated controllers, and filters controller initializers.leadermigration.Enabled, never creates the migration lock, and never filters migrated/non-migrated controllers.Leader election health checker wiring differs.
HealthzAdaptorand passes the same object both to health checks and toleaderElectAndRun.StartHTTPServercreates an election health checker for/healthz, then the command creates a separate checker forleaderelection.RunOrDie.Serving lifecycle is split differently.
Run().Event broadcaster handling diverges.
EventBroadcasterin config, starts structured logging and recording inRun(), and shuts it down withdefer.EventRecorder;createRecordercreates a broadcaster, starts logging/recording, and drops the broadcaster reference.Configz behavior diverges.
v1alpha1, sets the GVK, and returns errors if configz registration or setting fails.ComponentConfig, logs registration errors, and does this inside each controller generation.configz.New(ConfigzName)and fail on already-registered configz; configz may not reflect the active reloaded generation.Webhook support is missing locally.
WebhookHandler,AllWebhooks,DisabledByDefaultWebhooks, andCloudControllerManagerWebhookfeature-gated serving.Metrics wiring differs.
CreateControllerContext./metricswith Azure trace/OpenTelemetry metrics and/metrics/v2, but does not callcontrollersmetrics.Register()in itsCreateControllerContext.Cloud initialization happens at different lifecycle levels.
cloudprovider.Interfacefrom a provider-supplied cloud initializer before callingRun().Run()generation from either cloud config file or secret, then checksHasClusterID.Controller initializer API is locally forked.
ControllerInitFuncConstructor,ConstructControllerInitializers, andDefaultInitFuncConstructors.initFuncsignature and hardcodednewControllerInitializers().Controller set differs.
node-ipamand carries Azure-specific node IPAM options.Node filtering is Azure-specific.
--node-filter-requirementsand builds filtered informer factories.CreateControllerContextcannot be reused directly without preserving filtered informer construction.Options/config shape differs.
Generic.ClientConnection.Kubeconfig, carries webhook options, provider defaults,NodeControllerOptions, andEventBroadcaster.Kubeconfigfield, dynamic reload options, node IPAM options, node filtering, and no webhook/event broadcaster fields.Stop/shutdown model differs.
stopCh, validates that the command gets no positional args, and wiresstopChthrough serving, leader election wait, informers, and controllers.cmd.Context()for HTTP serving butcontext.TODO()for leader election andRunWrapper.Current upstream master has additional drift beyond the pinned module.
k8s.io/cloud-provider v0.35.0, upstream master moved flag printing after logging setup, tightened configz conversion/error handling, and applies node controller options throughNodeController.ApplyTo.Proposed direction:
Classify each difference as:
Extract an Azure-owned dynamic reload supervisor boundary:
Preserve leader-lease continuity:
OnStartedLeading;Handle leader migration explicitly:
Add tests covering:
OnStartedLeading;Why is this needed:
The local Azure entrypoint has drifted enough from upstream that routine upstream fixes are easy to miss. Some drift is required for Azure, especially dynamic reloading, but other gaps are accidental or under-specified:
Tracking the gaps in one issue gives reviewers a concrete checklist for migrating toward upstream code without accidentally regressing Azure's dynamic reload feature.