diff --git a/docs/proposals/20250425-aro-hcp.md b/docs/proposals/20250425-aro-hcp.md new file mode 100644 index 00000000000..93819682b70 --- /dev/null +++ b/docs/proposals/20250425-aro-hcp.md @@ -0,0 +1,1295 @@ +--- +title: ARO HCP Clusters +authors: + - "@serngawy" + - "@marek-veber" +reviewers: + - "@willie-yao" + - "@jackfrancis" + - "@bryan-cox" +creation-date: 2025-04-25 +last-updated: 2026-03-11 +status: implementable +--- + + +# Create ARO HCP Clusters + + +## Table of Contents + + +- [Introduction](#introduction) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Proposal](#proposal) +- [User Stories](#user-stories) +- [Implementation Details](#implementation-details) +- [Alternatives Considered](#alternatives-considered) +- [Maintenance and Ownership](#maintenance-and-ownership) +- [Risks and Mitigations](#risks-and-mitigations) +- [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) + + +## Introduction + + +The Cluster API Provider for Azure (CAPZ) extends Kubernetes Cluster API functionality to Microsoft Azure environments, facilitating the management and lifecycle of Kubernetes clusters on Azure. This proposal outlines the integration of Azure Red Hat OpenShift (ARO) Hosted Control Plane (HCP) clusters within CAPZ using Azure Service Operator (ASO) for resource provisioning. + +ARO HCP is an evolution of Azure Red Hat OpenShift that uses a hosted control plane architecture. For more information about ARO HCP, see the [ARO-HCP repository](https://github.com/Azure/ARO-HCP). + +The implementation leverages ASO resources embedded directly in Custom Resource specifications, providing a declarative, Kubernetes-native approach to managing ARO HCP infrastructure. + + +## Goals + + +- **Integration**: Enable provisioning and management of ARO HCP clusters using CAPZ with ASO. +- **API Conformance**: Align CAPZ operations with ARO HCP API specifications via ASO resources. +- **Scalability**: Support scaling of ARO HCP clusters through CAPZ controllers. +- **Operational Excellence**: Provide seamless cluster lifecycle management, including creation, scaling, and deletion. +- **Declarative Infrastructure**: Use embedded ASO resources for infrastructure-as-code approach. + + +## Non-Goals + + +- Support for non-HCP Azure services not specified in the ARO HCP API. +- Field-based provisioning mode (deprecated and removed in favor of resources mode). + + +## Proposal + + +CAPZ introduces controllers and reconcilers to manage ARO HCP clusters using ASO (Azure Service Operator) resources. Instead of field-based specifications, infrastructure is defined declaratively using embedded ASO resource definitions in the `spec.resources` field. + +This approach: +- Uses ASO's native resource types (HcpOpenShiftCluster, Vault, etc.) +- Provides full control over Azure resource configuration +- Eliminates dual-path provisioning logic +- Follows Kubernetes-native patterns for infrastructure management + + +Key capabilities include: + + +- **Cluster Provisioning**: Users define ARO HCP clusters declaratively via embedded ASO resources in the CRD. CAPZ controllers reconcile these resources using ASO. + +- **Lifecycle Management**: Controllers handle scaling, upgrades, and deletion of ARO HCP clusters by reconciling ASO resources. + +- **Dependency Management**: Proper sequencing ensures resources are ready before dependent resources are created (e.g., infrastructure ready before HCP cluster creation). + + +## User Stories + + +- As a Kubernetes operator, I want to deploy an ARO HCP cluster using Kubernetes-native ASO resources via CAPZ. +- As a platform engineer, I need declarative infrastructure management for ARO HCP clusters with full control over Azure resources. +- As a DevOps engineer, I want infrastructure-as-code for ARO HCP clusters that can be version-controlled and reviewed. + + +## Implementation Details + + +### Controller Architecture: + + +CAPZ implements three Custom Resource Definitions (CRDs) and their corresponding controllers to manage ARO HCP clusters: + + +#### AROControlPlane CRD + +Represents the desired state of an ARO HCP cluster's control plane. Uses embedded ASO resources for infrastructure provisioning. + +```go +// AROControlPlane is the Schema for the AROControlPlane API. +type AROControlPlane struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec AROControlPlaneSpec `json:"spec,omitempty"` + Status AROControlPlaneStatus `json:"status,omitempty"` +} + +type AROControlPlaneSpec struct { + // Resources are embedded ASO resources to be managed by this AROControlPlane. + // This allows you to define the full infrastructure including HcpOpenShiftCluster and + // HcpOpenShiftClustersExternalAuth resources directly using ASO types. + // + // Required. Must include at minimum: + // - HcpOpenShiftCluster resource + // - HcpOpenShiftClustersExternalAuth resource (if external auth is needed) + // + // Example resources (using private preview API v1api20240610preview): + // - redhatopenshift.azure.com/v1api20240610preview/HcpOpenShiftCluster + // - redhatopenshift.azure.com/v1api20240610preview/HcpOpenShiftClustersExternalAuth + // + // Note: v1api20240610preview is the private preview API. Migration to public preview + // API (v1api20251223preview) is planned. See API specs at: + // https://github.com/Azure/ARO-HCP/tree/main/api/redhatopenshift/resource-manager/Microsoft.RedHatOpenShift/hcpclusters/preview + Resources []runtime.RawExtension `json:"resources,omitempty"` + + // IdentityRef is a reference to an identity to be used when reconciling the ARO control plane. + // This field is optional. When set, CAPZ will: + // - Initialize Azure SDK credentials using the specified identity + // - Create encryption keys in Key Vault (but NOT the vault itself!) + // - Propagate key versions to HcpOpenShiftCluster spec + // + // IMPORTANT: Even when identityRef is set, the Vault resource must be declared in + // AROCluster.spec.resources[] so ASO can create the vault. CAPZ only creates the + // encryption KEY inside the existing vault. + // + // When NOT set (ASO credential-based mode): + // - ASO handles authentication via serviceoperator.azure.com/credential-from annotations + // - CAPZ skips Key Vault operations + // - Customers must manually create the vault (in AROCluster.spec.resources[]) and encryption key + // - The key version must be manually specified in HcpOpenShiftCluster spec + // + // +optional + IdentityRef *corev1.ObjectReference `json:"identityRef,omitempty"` + + // SubscriptionID is the GUID of the Azure subscription that owns this cluster. + // Required for Azure API authentication and ARM resource ID construction. + SubscriptionID string `json:"subscriptionID,omitempty"` + + // AzureEnvironment is the name of the AzureCloud to be used. + // The default value that would be used by most users is "AzurePublicCloud", other values are: + // - ChinaCloud: "AzureChinaCloud" + // - PublicCloud: "AzurePublicCloud" + // - USGovernmentCloud: "AzureUSGovernmentCloud" + // + // Note that values other than the default must also be accompanied by corresponding changes to the + // aso-controller-settings Secret to configure ASO to refer to the non-Public cloud. + AzureEnvironment string `json:"azureEnvironment,omitempty"` +} + +type AROControlPlaneStatus struct { + // Initialization status + Initialization *AROControlPlaneInitializationStatus `json:"initialization,omitempty"` + + // Ready indicates that the AROControlPlane API Server is ready to receive requests. + // Set to true when both HcpClusterReady condition is true AND kubeconfig exists. + Ready bool `json:"ready"` + + // FailureMessage will be set in the event that there is a terminal problem + FailureMessage *string `json:"failureMessage,omitempty"` + + // Conditions specifies the conditions for the managed control plane + // Key conditions: + // - HcpClusterReady: HCP cluster is fully operational + // - EncryptionKeyReady: ETCD encryption key is ready (if encryption configured) + // - ExternalAuthReady: External authentication is configured (if enabled) + Conditions clusterv1.Conditions `json:"conditions,omitempty"` + + // Resources status for each ASO resource managed by this control plane + Resources []ResourceStatus `json:"resources,omitempty"` + + // APIURL is the url for the ARO-HCP openshift cluster api endpoint. + APIURL string `json:"apiURL,omitempty"` + + // Version is the ARO-HCP OpenShift semantic version, for example "4.20.0". + Version string `json:"version,omitempty"` +} + +type AROControlPlaneInitializationStatus struct { + // ControlPlaneInitialized indicates whether or not the control plane has the + // uploaded kubeconfig secret and the secret is valid. + ControlPlaneInitialized bool `json:"controlPlaneInitialized,omitempty"` +} + +type ResourceStatus struct { + // Resource reference + Resource corev1.ObjectReference `json:"resource"` + + // Ready indicates if the resource is ready + Ready bool `json:"ready"` + + // Message provides details about the resource status + Message string `json:"message,omitempty"` +} +``` + +An example of AROControlPlane CR using resources mode: + +```yaml +apiVersion: controlplane.cluster.x-k8s.io/v1beta1 +kind: AROControlPlane +metadata: + name: my-cluster + namespace: default +spec: + identityRef: + kind: AzureClusterIdentity + name: aro-identity + namespace: default + subscriptionID: "00000000-0000-0000-0000-000000000000" + azureEnvironment: "AzurePublicCloud" # Optional, defaults to AzurePublicCloud + resources: + # HcpOpenShiftCluster - The main ARO HCP cluster resource + - apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftCluster + metadata: + name: my-cluster + namespace: default + spec: + azureName: my-cluster + owner: + name: my-cluster-resgroup + location: eastus + identity: + type: UserAssigned + userAssignedIdentities: + - reference: + armId: /subscriptions/.../userAssignedIdentities/my-cluster-cp-control-plane + properties: + api: + visibility: Public + clusterImageRegistry: + state: Enabled + etcd: + dataEncryption: + customerManaged: + encryptionType: KMS + kms: + activeKey: + name: etcd-data-kms-encryption-key + vaultName: my-cluster-kv + keyManagementMode: CustomerManaged + network: + hostPrefix: 23 + machineCidr: "10.0.0.0/16" + networkType: OVNKubernetes + podCidr: "10.128.0.0/14" + serviceCidr: "172.30.0.0/16" + platform: + managedResourceGroup: capz_node_my-cluster_rg + networkSecurityGroupReference: + group: network.azure.com + kind: NetworkSecurityGroup + name: my-cluster-nsg + operatorsAuthentication: + userAssignedIdentities: + controlPlaneOperatorsReferences: + control-plane: + armId: /subscriptions/.../userAssignedIdentities/... + outboundType: LoadBalancer + subnetReference: + group: network.azure.com + kind: VirtualNetworksSubnet + name: my-cluster-vnet-subnet + version: + channelGroup: stable + id: "4.20" + operatorSpec: + secrets: + adminCredentials: + key: value + name: my-cluster-kubeconfig + + # HcpOpenShiftClustersExternalAuth - Optional external authentication + - apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftClustersExternalAuth + metadata: + name: my-cluster-ea + namespace: default + spec: + azureName: my-cluster-ea + owner: + name: my-cluster # References the HcpOpenShiftCluster + properties: + claim: + mappings: + groups: + claim: groups + username: + claim: oid + prefixPolicy: NoPrefix + clients: + - clientId: "51ed0256-1d54-46d4-9479-14ac212ca10f" + component: + authClientNamespace: openshift-console + name: console + extraScopes: + - openid + - profile + type: Confidential + issuer: + audiences: + - "51ed0256-1d54-46d4-9479-14ac212ca10f" + url: "https://login.microsoftonline.com//v2.0" + +status: + initialization: + controlPlaneInitialized: true + ready: true + conditions: + - type: HcpClusterReady + status: "True" + reason: Succeeded + message: HCP cluster is fully operational + - type: EncryptionKeyReady + status: "True" + reason: KeyReady + message: Encryption key 'etcd-data-kms-encryption-key' version 'abc123' ready in vault 'my-cluster-kv' + - type: ExternalAuthReady + status: "True" + reason: Succeeded + message: External authentication configured successfully + resources: + - resource: + apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftCluster + name: my-cluster + namespace: default + ready: true + message: Cluster is provisioned and ready + apiURL: "https://api.my-cluster.example.com:6443" + version: "4.20.0" +``` + +**Example: ASO Credential-Based Mode (without identityRef)** + +This example shows AROControlPlane using ASO credentials instead of identityRef: + +```yaml +apiVersion: controlplane.cluster.x-k8s.io/v1beta1 +kind: AROControlPlane +metadata: + name: hcp-cluster + namespace: default +spec: + # identityRef is NOT set - using ASO credential-based authentication + # ASO uses serviceoperator.azure.com/credential-from annotations + subscriptionID: "00000000-0000-0000-0000-000000000000" + azureEnvironment: "AzurePublicCloud" + resources: + # HcpOpenShiftCluster - The main ARO HCP cluster resource + - apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftCluster + metadata: + name: hcp-cluster + namespace: default + annotations: + # ASO credential annotation - ASO uses this for authentication + serviceoperator.azure.com/credential-from: aso-credential + spec: + azureName: hcp-cluster + owner: + name: hcp-cluster-resgroup + location: eastus + identity: + type: UserAssigned + userAssignedIdentities: + - reference: + armId: /subscriptions/.../userAssignedIdentities/hcp-cluster-cp-control-plane + properties: + api: + visibility: Public + clusterImageRegistry: + state: Enabled + etcd: + dataEncryption: + customerManaged: + encryptionType: KMS + kms: + # IMPORTANT: When identityRef is not set, activeKey.version MUST be manually specified + # CAPZ cannot auto-create or propagate the key without Azure credentials + # Customer must manually create the key and specify the version here + activeKey: + vaultName: "my-vault" + name: "etcd-data-kms-encryption-key" + version: "abc123def456" # ✅ REQUIRED when identityRef is nil + keyManagementMode: CustomerManaged + network: + hostPrefix: 23 + machineCidr: "10.0.0.0/16" + networkType: OVNKubernetes + podCidr: "10.128.0.0/14" + serviceCidr: "172.30.0.0/16" + platform: + managedResourceGroup: capz_node_hcp-cluster_rg + networkSecurityGroupReference: + group: network.azure.com + kind: NetworkSecurityGroup + name: hcp-cluster-nsg + operatorsAuthentication: + userAssignedIdentities: + controlPlaneOperatorsReferences: + control-plane: + armId: /subscriptions/.../userAssignedIdentities/... + outboundType: LoadBalancer + subnetReference: + group: network.azure.com + kind: VirtualNetworksSubnet + name: hcp-cluster-vnet-subnet + version: + channelGroup: stable + id: "4.20" + operatorSpec: + secrets: + adminCredentials: + key: value + name: hcp-cluster-kubeconfig + +status: + ready: true + conditions: + - type: HcpClusterReady + status: "True" + reason: Succeeded + message: HCP cluster is fully operational + - type: EncryptionKeyReady + status: Unknown + reason: ManualKeyManagement + message: "identityRef not set - encryption key must be manually created and specified in HcpOpenShiftCluster" + resources: + - resource: + apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftCluster + name: hcp-cluster + namespace: default + ready: true + message: Cluster is provisioned and ready + apiURL: "https://api.hcp-cluster.example.com:6443" + version: "4.20.0" +``` + +**Note**: When using ASO credential-based mode: +- ✅ No need to manage AzureClusterIdentity +- ✅ ASO handles all authentication via credential annotations +- ⚠️ Customer must manually create Key Vault and encryption key +- ⚠️ Customer must manually specify `activeKey.version` in HcpOpenShiftCluster spec +- ⚠️ Webhook validation will reject the CR if activeKey.version is missing and encryption is configured + +#### AROMachinePool CRD + +Represents the desired state of the worker nodes (compute node pools) using embedded ASO HcpOpenShiftClustersNodePool resources. + +```go +// AROMachinePool is the Schema for the AROMachinePool API. +type AROMachinePool struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec AROMachinePoolSpec `json:"spec,omitempty"` + Status AROMachinePoolStatus `json:"status,omitempty"` +} + +// AROMachinePoolSpec defines the desired spec of AROMachinePool. +type AROMachinePoolSpec struct { + // Resources are embedded ASO resources to be managed by this AROMachinePool. + // Must include HcpOpenShiftClustersNodePool resource. + // + // Example (using private preview API v1api20240610preview): + // - redhatopenshift.azure.com/v1api20240610preview/HcpOpenShiftClustersNodePool + Resources []runtime.RawExtension `json:"resources"` +} + +// AROMachinePoolStatus defines the observed state of AROMachinePool. +type AROMachinePoolStatus struct { + // Conditions specifies the conditions for the machine pool + Conditions clusterv1.Conditions `json:"conditions,omitempty"` + + // Resources status for each ASO resource managed by this machine pool + Resources []ResourceStatus `json:"resources,omitempty"` + + // Ready indicates that the AROMachinePool has joined the cluster + Ready bool `json:"ready"` + + // FailureMessage will be set in the event that there is a terminal problem + FailureMessage *string `json:"failureMessage,omitempty"` + + // Replicas are the most recently observed number of replicas + Replicas int32 `json:"replicas"` +} +``` + +An example of AROMachinePool CR: + +```yaml +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: AROMachinePool +metadata: + name: my-cluster-mp1 + namespace: default +spec: + resources: + - apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftClustersNodePool + metadata: + name: my-cluster-mp1 + namespace: default + spec: + azureName: my-cluster-mp1 + owner: + name: my-cluster # References the HcpOpenShiftCluster + properties: + autoRepair: true + autoscaling: + max: 10 + min: 2 + labels: + node-role.kubernetes.io/worker: "" + platform: + diskSizeGiB: 120 + diskStorageAccountType: Premium_LRS + subnetReference: + group: network.azure.com + kind: VirtualNetworksSubnet + name: my-cluster-vnet-subnet + vmSize: Standard_D4s_v3 + version: + channelGroup: stable + id: "4.20" +status: + ready: true + replicas: 5 + conditions: + - type: Ready + status: "True" + reason: NodePoolReady + resources: + - resource: + apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftClustersNodePool + name: my-cluster-mp1 + ready: true +``` + +#### AROCluster CRD + +Represents the infrastructure cluster, managing infrastructure resources via embedded ASO resources. + +```go +// AROCluster is the Schema for the AROClusters API. +type AROCluster struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec AROClusterSpec `json:"spec,omitempty"` + Status AROClusterStatus `json:"status,omitempty"` +} + +// AROClusterSpec defines the desired spec of AROCluster. +type AROClusterSpec struct { + // Resources are embedded ASO resources to be managed by this AROCluster. + // Typically includes infrastructure resources like: + // - ResourceGroup + // - VirtualNetwork + // - Subnet + // - NetworkSecurityGroup + // - Vault (for encryption) + // - UserAssignedIdentity resources + // + // These resources are provisioned before the HcpOpenShiftCluster. + Resources []runtime.RawExtension `json:"resources,omitempty"` + + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + // Set by the controller from AROControlPlane status. + ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` + + // SubscriptionID is the Azure subscription ID for the cluster + SubscriptionID string `json:"subscriptionID"` + + // IdentityRef is a reference to an identity to be used when reconciling resources. + // This field is optional. When not set, ASO handles authentication via + // serviceoperator.azure.com/credential-from annotations. + // +optional + IdentityRef *corev1.ObjectReference `json:"identityRef,omitempty"` +} + +// AROClusterStatus defines the observed state of AROCluster. +type AROClusterStatus struct { + // Initialization status + Initialization *AROClusterInitializationStatus `json:"initialization,omitempty"` + + // Ready is when the infrastructure and control plane are ready. + Ready bool `json:"ready,omitempty"` + + // Conditions define the current service state of the AROCluster. + // Key conditions: + // - ResourcesReady: All infrastructure resources are ready + // - NetworkInfrastructureReady: Network infrastructure is ready + Conditions clusterv1.Conditions `json:"conditions,omitempty"` + + // Resources status for each ASO resource managed by this cluster + Resources []ResourceStatus `json:"resources,omitempty"` +} + +type AROClusterInitializationStatus struct { + // Provisioned indicates whether infrastructure is provisioned. + // Set to true when: + // 1. All infrastructure resources are ready (ResourcesReady condition) + // 2. AROControlPlane is ready (HcpClusterReady + kubeconfig exists) + // This gates CAPI from attempting to connect before the cluster is actually ready. + Provisioned bool `json:"provisioned,omitempty"` +} +``` + +An example of AROCluster CR: + +```yaml +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: AROCluster +metadata: + name: my-cluster + namespace: default +spec: + subscriptionID: "00000000-0000-0000-0000-000000000000" + controlPlaneEndpoint: + host: api.my-cluster.example.com + port: 6443 + identityRef: + kind: AzureClusterIdentity + name: aro-identity + resources: + # ResourceGroup + - apiVersion: resources.azure.com/v1api20200601 + kind: ResourceGroup + metadata: + name: my-cluster-resgroup + namespace: default + spec: + azureName: my-cluster-resgroup + location: eastus + + # VirtualNetwork + - apiVersion: network.azure.com/v1api20201101 + kind: VirtualNetwork + metadata: + name: my-cluster-vnet + namespace: default + spec: + azureName: my-cluster-vnet + owner: + name: my-cluster-resgroup + location: eastus + properties: + addressSpace: + addressPrefixes: + - "10.0.0.0/16" + + # Subnet + - apiVersion: network.azure.com/v1api20201101 + kind: VirtualNetworksSubnet + metadata: + name: my-cluster-vnet-subnet + namespace: default + spec: + azureName: my-cluster-subnet + owner: + name: my-cluster-vnet + properties: + addressPrefix: "10.0.0.0/24" + + # NetworkSecurityGroup + - apiVersion: network.azure.com/v1api20201101 + kind: NetworkSecurityGroup + metadata: + name: my-cluster-nsg + namespace: default + spec: + azureName: my-cluster-nsg + owner: + name: my-cluster-resgroup + location: eastus + + # KeyVault for encryption + # REQUIRED: Vault must be declared here even when identityRef is set + # ASO creates the vault, CAPZ creates the encryption key inside it + - apiVersion: keyvault.azure.com/v1api20230701 + kind: Vault + metadata: + name: my-cluster-kv + namespace: default + spec: + azureName: my-cluster-kv # Must match vaultName in HcpOpenShiftCluster + owner: + name: my-cluster-resgroup + location: eastus + properties: + sku: + family: A + name: standard + tenantId: "00000000-0000-0000-0000-000000000000" + enableRbacAuthorization: true + enableSoftDelete: true + + # User Assigned Identities (example - multiple needed for ARO) + - apiVersion: managedidentity.azure.com/v1api20230131 + kind: UserAssignedIdentity + metadata: + name: my-cluster-cp-control-plane + namespace: default + spec: + azureName: my-cluster-cp-control-plane + owner: + name: my-cluster-resgroup + location: eastus + +status: + initialization: + provisioned: true + ready: true + conditions: + - type: ResourcesReady + status: "True" + reason: InfrastructureReady + message: "All 7 infrastructure resources are ready" + - type: NetworkInfrastructureReady + status: "True" + reason: Succeeded + resources: + - resource: + kind: ResourceGroup + name: my-cluster-resgroup + ready: true + - resource: + kind: VirtualNetwork + name: my-cluster-vnet + ready: true + # ... other resources +``` + +### Controller Responsibilities + +Each controller is responsible for: + +#### AROControlPlane Controller + +- Watches `AROControlPlane` resources +- Reconciles embedded ASO resources (HcpOpenShiftCluster, HcpOpenShiftClustersExternalAuth) +- Manages encryption key creation via keyvaults service +- Updates resource status based on ASO resource conditions +- Sets `Ready` status when: + 1. HcpClusterReady condition is True (from ASO) + 2. Kubeconfig secret exists +- Sets `ControlPlaneInitialized` when both conditions above are met +- Applies mutators to: + - Set defaults on HcpOpenShiftCluster + - Inject encryption key version + - Set owner references + +Key features: +- **Dependency Management**: Waits for AROCluster infrastructure (ResourcesReady condition) before creating HcpOpenShiftCluster +- **ExternalAuth Deferral**: Uses resource filtering to defer ExternalAuth creation until NodePool is ready to avoid Azure API errors +- **Encryption Key Management**: Uses keyvaults service to create/retrieve encryption keys, sets version in HcpOpenShiftCluster via mutator +- **Condition Tracking**: + - `HcpClusterReady`: HCP cluster operational status + - `EncryptionKeyReady`: Encryption key status (if configured) + - `ExternalAuthReady`: External auth status (if configured) + +#### AROMachinePool Controller + +- Watches `AROMachinePool` resources +- Reconciles embedded ASO HcpOpenShiftClustersNodePool resources +- Waits for AROControlPlane to be ready before creating node pools +- Updates resource status based on ASO resource conditions +- Applies mutators to set defaults and owner references +- **Populates providerIDList from workload cluster nodes** (no Azure SDK required): + - Uses ClusterTracker to get workload cluster client + - Lists nodes from workload cluster + - Filters nodes by node pool name pattern + - Extracts providerID from node.Spec.ProviderID + - Pattern matches ASOManagedMachinePool for consistency + +#### AROCluster Controller + +- Watches `AROCluster` resources +- Reconciles embedded ASO infrastructure resources (ResourceGroup, VNet, Subnet, NSG, Vault, Identities) +- Tracks infrastructure readiness via `ResourcesReady` condition +- Mirrors AROControlPlane endpoint to spec.controlPlaneEndpoint +- Sets `Initialization.Provisioned` when: + 1. ResourcesReady condition is True + 2. AROControlPlane.Ready is True +- This gates CAPI from connecting before the cluster is truly ready + +### Dependency Chain + +The proper dependency sequence ensures resources are created in the correct order: + +``` +1. AROCluster infrastructure resources created + ↓ +2. ResourcesReady condition becomes True + ↓ +3. HcpOpenShiftCluster created (waits for step 2) + ↓ +4. HcpClusterReady condition becomes True + ↓ +5. Kubeconfig secret created by ASO + ↓ +6. AROControlPlane.Ready becomes True (checks steps 4 & 5) + ↓ +7. AROCluster.Ready becomes True + ↓ +8. AROCluster.Initialization.Provisioned becomes True + ↓ +9. CAPI Cluster.InfrastructureProvisioned becomes True + ↓ +10. CAPI connects to cluster +``` + +This dependency chain **eliminates CAPI kubeconfig errors** by ensuring the kubeconfig exists before CAPI attempts to connect. + +### Syncing Between Controllers + +- **Ownership and References**: + - ASO resources use `owner` references to link resources (e.g., NodePool → HcpOpenShiftCluster) + - CAPI resources use `ownerReferences` for cascading deletes + - AROMachinePool references AROControlPlane via cluster name + +- **Reconciliation Ordering**: + - AROCluster resources must be ready before AROControlPlane creates HcpOpenShiftCluster + - AROControlPlane must be ready before AROMachinePool creates NodePools + - Controllers watch related resources and requeue when dependencies change + +- **Status Propagation**: + - Resource statuses are aggregated into CRD conditions + - Parent resources track child resource readiness + - CAPI Cluster reflects AROCluster initialization status + +- **Error Handling and Retries**: + - Controllers use exponential backoff for transient errors + - ASO handles Azure API rate limiting and retries + - Terminal errors are surfaced in status conditions + +### Resources Mode Architecture + +The implementation uses a **resources-only mode** where all infrastructure is defined using embedded ASO resources: + +**Benefits:** +- **Declarative**: Full infrastructure as code +- **Flexible**: Direct access to all ASO resource properties +- **Version controlled**: Resources can be managed in git +- **Type-safe**: Uses ASO's generated types +- **Consistent**: Same pattern for all Azure resources + +**Implementation:** +- ASO resources are embedded in `spec.resources` as RawExtension +- Controllers convert RawExtension to Unstructured for processing +- Mutators apply defaults and inject dynamic values (e.g., encryption key version) +- ResourceReconciler manages ASO resource lifecycle +- Status tracking reports on each resource's readiness + +### Authentication Modes + +AROControlPlane supports two authentication modes for Azure API access: + +#### 1. CAPZ Managed Authentication (with identityRef) + +When `identityRef` is specified: +- CAPZ initializes Azure SDK credentials from AzureClusterIdentity +- CAPZ performs Key Vault operations automatically: + - Creates encryption keys in Key Vault + - Retrieves key versions + - Propagates key versions to HcpOpenShiftCluster spec +- **Best for**: Users who want automated key management +- **Configuration**: Set `spec.identityRef` to reference an AzureClusterIdentity + +Example: +```yaml +spec: + identityRef: + kind: AzureClusterIdentity + name: aro-identity + namespace: default +``` + +#### 2. ASO Credential-Based Authentication (without identityRef) + +When `identityRef` is NOT specified: +- ASO handles authentication via `serviceoperator.azure.com/credential-from` annotations +- CAPZ skips all Key Vault operations +- Customer responsibilities: + - Manually create Key Vault via ASO + - Manually create encryption key + - **Manually specify activeKey.version in HcpOpenShiftCluster spec** +- **Best for**: Users who want full control via ASO +- **Configuration**: Omit `spec.identityRef` and use ASO credential annotations + +Example: +```yaml +spec: + # identityRef: NOT SET - using ASO credentials + subscriptionID: "00000000-0000-0000-0000-000000000000" + resources: + - apiVersion: redhatopenshift.azure.com/v1api20240610preview + kind: HcpOpenShiftCluster + spec: + properties: + etcd: + dataEncryption: + customerManaged: + kms: + activeKey: + vaultName: "my-vault" + name: "etcd-data-kms-encryption-key" + version: "abc123def456" # REQUIRED when identityRef not set +``` + +### KeyVault Service and Encryption Key Management + +While most resources are managed via ASO, the **keyvaults service** handles encryption key management when `identityRef` is set: + +**Why keep keyvaults service:** +- ASO manages the Vault **resource** (the vault itself) +- Azure SDK (via keyvaults service) manages **keys within the vault** +- ASO doesn't support key version retrieval +- Key version must be injected into HcpOpenShiftCluster spec + +**IMPORTANT: Vault resource location** +- ⚠️ The Vault resource must be declared in **AROCluster.spec.resources[]**, NOT AROControlPlane +- Even when `identityRef` is set, CAPZ does NOT create the vault itself +- CAPZ only creates the encryption **KEY** inside an existing vault +- ASO creates the vault based on the Vault resource in AROCluster.spec.resources[] + +**How it works (with identityRef):** +1. **Vault resource created via ASO** (declared in **AROCluster.spec.resources[]** - REQUIRED!) +2. keyvaults service waits for vault to be ready (checks AROCluster.Status.Resources) +3. keyvaults service extracts vault metadata from AROCluster.spec.resources[] +4. keyvaults service queries deployed Vault and gets resource group from status.id +5. keyvaults service creates/retrieves encryption key using Azure SDK +6. Key version stored in scope via SetVaultInfo() +7. Mutator injects key version into HcpOpenShiftCluster spec +8. HcpOpenShiftCluster references the key for ETCD encryption + +**How it works (without identityRef):** +1. Customer creates Vault via ASO (in AROCluster.spec.resources[]) +2. Customer creates encryption key manually (via Azure Portal, CLI, or ASO) +3. Customer specifies activeKey.version in HcpOpenShiftCluster spec +4. CAPZ validates activeKey.version is present (see validation section below) +5. HcpOpenShiftCluster uses the manually specified key + +**KeyVault Refactoring (2026-02-26):** + +The KeyVault service was refactored to properly handle `spec.azureName` and eliminate hardcoded API versions: + +1. **Vault metadata discovery**: `getVaultK8sInfo()` extracts K8s object name, namespace, and API version from AROCluster.spec.resources[] +2. **Resource group extraction**: `getVaultResourceGroupFromStatus()` queries the deployed Vault and parses resource group from `status.id` +3. **Robust ARM ID parsing**: `parseARMResourceAttribute()` uses attribute/value pair pattern instead of position-based parsing +4. **No hardcoded versions**: API versions dynamically discovered from AROCluster.spec.resources[] +5. **Production tested**: Successfully deployed in mv9-stage cluster with K8s name (`mv9-stage-kv`) ≠ Azure name (`mv9-stage-kv-actual`) + +**Error messages**: +- Clear guidance when vault not found: "vault with azureName 'X' not found in AROCluster spec.resources - when using encryption with identityRef, the Vault resource must be declared in AROCluster.spec.resources[]" +- Explains that CAPZ creates the key, but ASO creates the vault + +### Encryption Key Version Validation + +To prevent deployment failures, CAPZ implements **two-layer validation** for encryption key versions when `identityRef` is not set: + +#### Layer 1: Webhook Validation (Create/Update Time) + +**Purpose**: Fail fast with clear error messages before any reconciliation starts. + +**Implementation**: `exp/api/controlplane/v1beta2/arocontrolplane_webhook.go` + +**Validation Logic**: +1. For each resource in `spec.resources` +2. If resource is HcpOpenShiftCluster +3. If ETCD encryption is configured (`spec.properties.etcd.dataEncryption.customerManaged.kms` exists) +4. AND `identityRef` is NOT set +5. THEN validate that `kms.activeKey.version` is specified + +**Error Message**: +``` +activeKey.version is required when identityRef is not set - +CAPZ cannot auto-create or propagate the encryption key without Azure credentials +``` + +**Example - Invalid (will be rejected by webhook)**: +```yaml +spec: + # identityRef: NOT SET + resources: + - kind: HcpOpenShiftCluster + spec: + properties: + etcd: + dataEncryption: + customerManaged: + kms: + # activeKey.version: MISSING - VALIDATION ERROR + # activeKey structure is completely missing +``` + +**Example - Valid (will be accepted)**: +```yaml +spec: + # identityRef: NOT SET + resources: + - kind: HcpOpenShiftCluster + spec: + properties: + etcd: + dataEncryption: + customerManaged: + kms: + activeKey: + vaultName: "my-vault" + name: "etcd-data-kms-encryption-key" + version: "abc123def456" # ✅ Specified +``` + +#### Layer 2: Controller Runtime Validation (Reconciliation Time) + +**Purpose**: Safety net in case webhook is bypassed or disabled. + +**Implementation**: `exp/controllers/arocontrolplane_reconciler.go` + +**Validation Logic**: +1. During reconciliation, detect encryption configuration +2. If encryption configured AND `identityRef` is NOT set +3. Validate activeKey.version is present in KMS configuration +4. If missing: + - Set `EncryptionKeyReadyCondition` to `False` with reason `KeyVersionMissing` + - Return error to prevent HcpOpenShiftCluster deployment + - Log detailed error message + +**Error Message**: +``` +identityRef is not set and activeKey.version is not specified in HcpOpenShiftCluster - +CAPZ cannot auto-create or propagate the encryption key without Azure credentials. +Please manually create the vault and key via ASO and specify the version in +spec.properties.etcd.dataEncryption.customerManaged.kms.activeKey.version +``` + +**Benefits**: +- ✅ Prevents wasted reconciliation cycles +- ✅ Clear, actionable error messages +- ✅ Fails before attempting Azure operations +- ✅ Guides customers to fix the configuration + +### Common Issues and FAQ + +#### Issue: "vault with azureName 'X' not found in AROCluster spec.resources" + +**Cause**: The Vault resource is missing from AROCluster.spec.resources[]. + +**Understanding**: When using `identityRef` with encryption: +- ✅ CAPZ creates the encryption **KEY** inside the vault +- ❌ CAPZ does NOT create the **vault itself** +- ✅ ASO creates the vault (from AROCluster.spec.resources[]) + +**Solution**: Add the Vault resource to **AROCluster.spec.resources[]** (not AROControlPlane): + +```yaml +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: AROCluster +metadata: + name: my-cluster +spec: + resources: + - apiVersion: keyvault.azure.com/v1api20230701 + kind: Vault + metadata: + name: my-cluster-kv + namespace: default + spec: + azureName: my-cluster-kv # Must match vaultName in HcpOpenShiftCluster + location: eastus2 + owner: + name: my-resource-group + properties: + enableRbacAuthorization: true + sku: + family: A + name: standard + # ... other resources +``` + +**Additional requirements**: +1. KMS identity needs `Key Vault Crypto Officer` role on the vault +2. Add RoleAssignment in AROCluster.spec.resources[] for the KMS identity + +#### Issue: HcpOpenShiftCluster stuck in "Reconciling" state + +**Common causes**: +1. **Missing `machineCidr`**: Required even with `networkType: "Other"` +2. **Vault not ready**: Wait for vault to be provisioned +3. **Missing role assignments**: KMS identity needs vault permissions +4. **Azure provisioning time**: HCP clusters take 15-30 minutes to provision + +**Debugging**: +```bash +# Check encryption key status +kubectl get arocontrolplane -n -o jsonpath='{.status.conditions[?(@.type=="EncryptionKeyReady")]}' + +# Check HCP cluster conditions +kubectl get hcpopenshiftclusters -n -o yaml | grep -A 20 "conditions:" + +# Check CAPZ logs +kubectl logs -n capz-system deploy/capz-controller-manager --tail=100 +``` + +#### Issue: "identityRef is not set and activeKey.version is not specified" + +**Cause**: Using ASO credential mode without specifying the key version. + +**Solution**: When using ASO credentials (identityRef not set): +1. Create vault in AROCluster.spec.resources[] +2. Manually create the encryption key (Azure Portal/CLI) +3. Specify the key version in HcpOpenShiftCluster: + +```yaml +spec: + properties: + etcd: + dataEncryption: + customerManaged: + kms: + activeKey: + vaultName: my-vault + name: etcd-data-kms-encryption-key + version: "abc123def456" # REQUIRED! +``` + +### Azure Resource Management + +- **ASO (Azure Service Operator)**: Primary method for resource provisioning + - HcpOpenShiftCluster + - HcpOpenShiftClustersNodePool + - HcpOpenShiftClustersExternalAuth + - ResourceGroup, VirtualNetwork, Subnet, NetworkSecurityGroup + - **Vault** (even when identityRef is set!) + - UserAssignedIdentity, RoleAssignment + +- **Azure SDK**: Used only for encryption key management (when identityRef is set) + - KeyVault keys (create/retrieve key versions within existing vault) + - **Note**: This is a temporary gap in ASO coverage. ASO supports Vault CRDs but not key management within vaults. Upstream tracking: [Azure/azure-service-operator#3188](https://github.com/Azure/azure-service-operator/issues/3188) + - **Future**: Once ASO adds support for key version retrieval, the keyvaults SDK service can be removed entirely + +### Validation and Testing + +#### Validation + +**Webhook Validation** (`exp/api/controlplane/v1beta2/arocontrolplane_webhook.go`): +- Resources mode validation (ensures `spec.resources` is not empty) +- Identity validation (validates `identityRef` if provided) +- Resource structure validation (valid JSON, required fields) +- **Encryption key version validation** (when identityRef is nil and encryption configured) + +**Controller Runtime Validation** (`exp/controllers/arocontrolplane_reconciler.go`): +- Resource readiness checks before creating dependent resources +- **Encryption key version validation** (safety net for webhook bypass) +- Condition setting for validation failures +- Proper error propagation with actionable messages + +**Validation Flow**: +``` +Create/Update AROControlPlane + ↓ +Webhook Validation +├─ Resources mode check +├─ Identity validation +├─ Resource structure +└─ Encryption key version ← FAIL FAST if missing + ↓ +Controller Reconciliation +├─ Runtime validation ← SAFETY NET +├─ Resource dependency checks +└─ Condition updates + ↓ +Resource Deployment +``` + +#### Testing + +- Unit tests for controller reconciliation logic +- Unit tests for webhook validation (including encryption key validation) +- Integration tests for ASO resource creation +- E2E tests for full cluster lifecycle (with and without identityRef) +- Validation of proper dependency sequencing +- Testing of error conditions and recovery +- Testing of both authentication modes (CAPZ managed vs ASO credential-based) + +## Alternatives Considered + +### Using AzureASOManagedControlPlane + +AzureASOManagedControlPlane is designed for AKS managed clusters using `containerservice.azure.com` resources. ARO HCP uses fundamentally different Azure resource types (`redhatopenshift.azure.com/HcpOpenShiftCluster`) with ARO-specific requirements including ETCD encryption with Key Vault integration, external authentication resources, and different node pool APIs. The architectural differences are significant enough that separate CRDs provide clearer API boundaries and simpler controller logic than trying to unify both platforms under a single type. + +### Pure ASO Without CAPZ Controllers + +While ASO provides Azure resource management, CAPZ controllers add Cluster API integration (lifecycle management, dependency orchestration, status aggregation) and automation features like encryption key version injection. Users who don't need CAPI integration can use ASO directly with `identityRef: nil` mode. + +## Maintenance and Ownership + +The ARO HCP integration is **owned and maintained by the Azure Red Hat OpenShift (ARO) team at Red Hat**. This includes the AROControlPlane, AROMachinePool, and AROCluster CRDs, controllers, webhooks, and ARO-specific documentation. + +**CAPZ core team responsibilities** are limited to maintaining shared infrastructure (ASO integration patterns, common libraries) and reviewing ARO pull requests for CAPZ architectural alignment. The CAPZ core team is **not expected to own or maintain ARO-specific logic**. + +### Azure SDK Usage and ASO Migration Plan + +The keyvaults service currently uses Azure SDK for key version retrieval because ASO doesn't yet support Key Vault key management CRDs (only Vault resources). This SDK usage is temporary and will be removed once ASO coverage is available. + +**ARO HCP API Versions**: +- **2024-06-10-preview**: Private preview API version (currently used in this implementation) +- **2025-12-23-preview**: Public preview API version (planned migration target) + +API specifications are maintained in the [ARO-HCP repository](https://github.com/Azure/ARO-HCP/tree/main/api/redhatopenshift/resource-manager/Microsoft.RedHatOpenShift/hcpclusters/preview). + +**Dependency chain for full ASO migration**: +1. ARO HCP public preview completion (pending Microsoft release formalities) +2. ARO HCP API specification merged into Azure REST API specs repository +3. ASO generation of ARO HCP CRDs from updated specs +4. Backporting ASO changes to the release version used by CAPZ + +Once these dependencies are resolved and ASO supports key management, the keyvaults SDK service will be removed entirely. The ARO team commits to migrating to pure ASO-native implementation at that time. + +## Risks and Mitigations + +- **ASO API Changes**: Monitor ASO releases and update resource definitions accordingly +- **Azure API Changes**: ARO HCP API changes will require ASO updates +- **Security**: Use managed identities and RBAC for Azure resource access +- **Dependency Timing**: Proper condition checking prevents premature resource creation +- **Resource Cleanup**: Owner references ensure cascading deletion of related resources + +## Graduation Criteria + +- ✅ Successful provisioning of ARO HCP clusters using resources mode +- ✅ Proper dependency chain implementation preventing CAPI errors +- ✅ Validation against Azure and Kubernetes integration benchmarks +- ✅ Encryption key management working correctly +- ✅ ExternalAuth configuration functioning properly +- ✅ Clean code with field-based mode removed (~5,400 lines removed) +- ✅ Maintenance ownership documented and agreed upon +- ✅ Plan documented to remove keyvaults SDK service when ASO supports key version retrieval +- ✅ Acknowledgment that CAPZ's long-term direction is toward pure ASO-native controllers + +## Implementation History + +- 2025-04-23: Initial proposal +- 2025-04-25: Proposal marked as implementable +- 2026-02-05: Implementation completed with resources mode +- 2026-02-07: Dependency chain refactored for proper sequencing +- 2026-02-08: Field-based provisioning code removed, proposal updated to reflect current implementation +- 2026-02-09: AROControlPlaneSpec field cleanup - removed redundant fields (domainPrefix, version, channelGroup, versionGate, additionalTags) that duplicated ASO resource configuration. Retained only functionally required fields: Resources, IdentityRef, SubscriptionID, AzureEnvironment. This simplifies the API and enforces resources-only mode where all cluster configuration is defined in embedded ASO resources. +- 2026-02-20: **ARO-24514 implementation** - Made `identityRef` optional to support ASO credential-based authentication. Added two-layer validation (webhook + runtime) for encryption key version when identityRef is not set. This enables customers to use ASO credentials without managing separate AzureClusterIdentity, while ensuring proper validation prevents deployment failures. Key changes: + - identityRef is now optional (previously required for Key Vault operations) + - **AROControlPlane**: CAPZ skips Key Vault operations when identityRef is not set + - **AROMachinePool**: CAPZ skips Azure credential initialization when identityRef is not set + - Webhook validates activeKey.version is specified when encryption is configured and identityRef is nil + - Controller performs runtime validation as safety net + - Clear error messages guide customers to manually specify activeKey.version + - Documentation updated with authentication modes and validation requirements +- 2026-02-20: **AROMachinePool refactoring** - Removed Azure SDK dependency and replaced Azure VM API calls with workload cluster node listing (following ASOManagedMachinePool pattern). AROMachinePool is now fully ASO-native with no Azure credentials required. Key changes: + - Removed AzureClients and CredentialCache from AROMachinePool scope and reconciler + - Removed virtualMachines service and Azure SDK dependencies + - Added ClusterTracker to get workload cluster client (similar to ASOManagedMachinePool) + - Populates providerIDList from workload cluster nodes instead of Azure VMs + - Lists nodes from workload cluster and filters by node pool name pattern + - More reliable - no dependency on Azure API availability + - Consistent with other ASO-based machine pool implementations +- 2026-02-26: **KeyVault service refactoring and error message improvements** - Simplified KeyVault resource group extraction and improved error messages to prevent customer confusion. Production tested with mv9-stage cluster. Key changes: + - Added `getVaultK8sInfo()` to extract vault metadata (name, namespace, API version) from AROCluster.spec.resources[] + - Added `getVaultResourceGroupFromStatus()` to query deployed Vault and extract resource group from status.id + - Added `parseARMResourceAttribute()` for robust ARM ID parsing using attribute/value pairs + - Removed hardcoded API versions - now dynamically discovered from AROCluster.spec.resources[] + - Simplified KeyVaultScope interface - removed 3 unused methods (ResourceGroup, SubscriptionID, AsyncStatusUpdater) + - Improved error message: "vault with azureName 'X' not found in AROCluster spec.resources - when using encryption with identityRef, the Vault resource must be declared in AROCluster.spec.resources[]" + - Updated API documentation to clarify: "Even when identityRef is set, the Vault resource must be declared in AROCluster.spec.resources[] so ASO can create the vault. CAPZ only creates the encryption KEY inside the existing vault" + - Added Common Issues/FAQ section to proposal with debugging guidance + - Successfully deployed in production (mv9-stage) with K8s name ≠ Azure name (spec.azureName)