From 32f355acf14cbb3bcf22c7f718620b8b89d779a9 Mon Sep 17 00:00:00 2001 From: Pavol Loffay Date: Wed, 13 May 2026 17:13:10 +0200 Subject: [PATCH 1/4] Add RFC: Instrumentation v1beta1 Signed-off-by: Pavol Loffay --- docs/rfcs/instrumentation-v1beta1.md | 406 +++++++++++++++++++++++++++ 1 file changed, 406 insertions(+) create mode 100644 docs/rfcs/instrumentation-v1beta1.md diff --git a/docs/rfcs/instrumentation-v1beta1.md b/docs/rfcs/instrumentation-v1beta1.md new file mode 100644 index 0000000000..dd37d38f73 --- /dev/null +++ b/docs/rfcs/instrumentation-v1beta1.md @@ -0,0 +1,406 @@ +# Instrumentation v1beta1 + +This document outlines the next version of the Instrumentation CRD - `v1beta1`. + +## Motivation + +The current `v1alpha1` Instrumentation CRD has been widely adopted in production environments despite its `v1alpha1` version. Promoting to `v1beta1` signals API stability and provides an opportunity to address accumulated design issues. The `v1beta1` version should be a breaking change from `v1alpha1` that: + +1. Aligns with OpenTelemetry's [declarative configuration](https://github.com/open-telemetry/opentelemetry-configuration) initiative +2. Fixes structural inconsistencies in the current API + +## Objectives + +1. Support strongly-typed [declarative configuration](https://github.com/open-telemetry/opentelemetry-configuration) in `spec.declarativeConfig` alongside existing env-var-based configuration in `spec.envConfig` ([#4093](https://github.com/open-telemetry/opentelemetry-operator/issues/4093)) +2. Add explicit OTLP exporter protocol field to avoid ambiguity between HTTP and gRPC endpoints ([#3658](https://github.com/open-telemetry/opentelemetry-operator/issues/3658)) +3. Normalize per-language resource fields — remove deprecated [`volumeLimitSize`](https://github.com/open-telemetry/opentelemetry-operator/blob/main/apis/v1alpha1/instrumentation_types.go#L178) and unify inconsistent JSON tags ([`json:"resources"`](https://github.com/open-telemetry/opentelemetry-operator/blob/main/apis/v1alpha1/instrumentation_types.go#L188) vs [`json:"resourceRequirements"`](https://github.com/open-telemetry/opentelemetry-operator/blob/main/apis/v1alpha1/instrumentation_types.go#L228)) +4. Consolidate `Resource` and `Defaults` into a single top-level `spec.resource` field for operator-level resource attribute configuration + +## Non-Goals (for initial v1beta1) + +* Label selectors for targeting workloads ([#2744](https://github.com/open-telemetry/opentelemetry-operator/issues/2744), [#821](https://github.com/open-telemetry/opentelemetry-operator/issues/821)) - additive feature, can be added later without breaking changes +* Webhook architecture separation ([#5010](https://github.com/open-telemetry/opentelemetry-operator/issues/5010), [#4115](https://github.com/open-telemetry/opentelemetry-operator/issues/4115)) - operational concern, not CRD spec +* Windows node support ([#642](https://github.com/open-telemetry/opentelemetry-operator/issues/642)) - can be added without breaking changes +* New language support - can be added incrementally + +## Proposed Changes + +### 1. SDK Declarative Configuration + +**Issues:** [#4093](https://github.com/open-telemetry/opentelemetry-operator/issues/4093), [#4607](https://github.com/open-telemetry/opentelemetry-operator/issues/4607) + +OpenTelemetry is standardizing on [file-based declarative configuration](https://opentelemetry.io/docs/specs/otel/configuration/) as the preferred way to configure SDKs. The v1beta1 CRD supports two mutually exclusive configuration approaches: + +- **`spec.declarativeConfig`** — strongly-typed Go structs matching the [OTel SDK configuration schema](https://github.com/open-telemetry/opentelemetry-configuration). The operator serializes this to a YAML file, mounts it into the workload, and sets `OTEL_CONFIG_FILE`. +- **`spec.envConfig`** — the existing env-var-based configuration (`exporter`, `sampler`, `propagators`, `resource`), moved under a dedicated field. This preserves the current v1alpha1 behavior. + +Setting both `declarativeConfig` and `envConfig` is invalid and rejected by the webhook. + +#### Declarative config example + +```yaml +apiVersion: opentelemetry.io/v1beta1 +kind: Instrumentation +metadata: + name: declarative-example +spec: + declarativeConfig: + file_format: "0.4" + resource: + attributes: + - name: service.namespace + value: production + tracer_provider: + sampler: + parent_based: + root: + trace_id_ratio_based: + ratio: 0.25 + processors: + - batch: {} + exporters: + - otlp: + endpoint: http://collector:4318 + protocol: http/protobuf +``` + +#### Environment variable substitution in declarative config + +The declarative config supports [environment variable substitution](https://opentelemetry.io/docs/specs/otel/configuration/file-configuration/#environment-variable-substitution) using the `${VAR}` syntax. This is useful for injecting secrets like API tokens without hardcoding them in the CR. Environment variables can be set via `spec.env` or per-language `env` fields. + +This also addresses a limitation in v1alpha1 where Kubernetes `$(VAR)` substitution fails due to env var ordering ([#3022](https://github.com/open-telemetry/opentelemetry-operator/issues/3022)). Since `${VAR}` substitution in declarative config happens at SDK runtime rather than Kubernetes pod creation time, it works regardless of the order in which env vars are defined. + +```yaml +apiVersion: opentelemetry.io/v1beta1 +kind: Instrumentation +metadata: + name: declarative-with-secret +spec: + env: + - name: OTEL_EXPORTER_API_KEY + valueFrom: + secretKeyRef: + name: otel-secrets + key: api-key + declarativeConfig: + file_format: "0.4" + tracer_provider: + processors: + - batch: {} + exporters: + - otlp: + endpoint: https://otlp.example.com:4318 + protocol: http/protobuf + headers: + - name: x-api-key + value: ${OTEL_EXPORTER_API_KEY} +``` + +#### Env-var config example (current behavior) + +```yaml +apiVersion: opentelemetry.io/v1beta1 +kind: Instrumentation +metadata: + name: env-example +spec: + envConfig: + exporter: + endpoint: http://collector:4318 + protocol: http/protobuf + sampler: + type: parentbased_traceidratio + argument: "0.25" + propagators: + - tracecontext + - baggage + resource: + attributes: + service.namespace: production +``` + +### 2. Explicit Exporter Protocol + +**Issues:** [#3658](https://github.com/open-telemetry/opentelemetry-operator/issues/3658) + +The v1alpha1 `spec.exporter` struct has a single `endpoint` field with no indication of whether it expects HTTP or gRPC. This is a common source of confusion because different SDK auto-instrumentation images default to different protocols: + +| Language | Default OTLP Protocol | Default Port | Operator Override | +|----------|----------------------|--------------|-------------------| +| Java | `http/protobuf` | 4318 | No (SDK default since Java agent 2.x) | +| NodeJS | `http/protobuf` | 4318 | No (SDK default) | +| Python | `grpc` | 4317 | Yes — operator forces `http/protobuf` (port 4318) | +| DotNet | `http/protobuf` | 4318 | No (auto-instrumentation default; differs from .NET SDK which defaults to `grpc`) | +| Go | `http/protobuf` | 4318 | No (auto-instrumentation default; Go SDK itself defaults to `grpc`) | +| Apache HTTPD | `grpc` | 4317 | No (otel-webserver-module only supports gRPC; [proposal to add HTTP](https://github.com/open-telemetry/opentelemetry-cpp-contrib/issues/614)) | +| Nginx | `grpc` | 4317 | No (otel-webserver-module only supports gRPC; [proposal to add HTTP](https://github.com/open-telemetry/opentelemetry-cpp-contrib/issues/614)) | + +The v1beta1 adds an explicit `protocol` field to the `Exporter` struct. When set, the operator injects `OTEL_EXPORTER_OTLP_PROTOCOL` alongside the endpoint. Valid values are `grpc`, `http/protobuf`, and `http/json`. + +```go +type Exporter struct { + Endpoint string `json:"endpoint,omitempty"` + Protocol string `json:"protocol,omitempty"` + TLS *TLS `json:"tls,omitempty"` +} +``` + +### 3. Normalize Per-Language Resource Fields + +The current v1alpha1 has inconsistent JSON tags across language structs: +- Java uses [`"resources"`](https://github.com/open-telemetry/opentelemetry-operator/blob/main/apis/v1alpha1/instrumentation_types.go#L188) while NodeJS/Python/DotNet/Go use [`"resourceRequirements"`](https://github.com/open-telemetry/opentelemetry-operator/blob/main/apis/v1alpha1/instrumentation_types.go#L228) +- The deprecated `volumeLimitSize` field exists on all languages + +The v1beta1 normalizes all per-language structs to use a common base: + +```go +// CommonLanguageSpec contains fields shared by all language-specific configurations. +type CommonLanguageSpec struct { + Image string `json:"image,omitempty"` + VolumeClaimTemplate corev1.PersistentVolumeClaimTemplate `json:"volumeClaimTemplate,omitempty"` + Env []corev1.EnvVar `json:"env,omitempty"` + Resources corev1.ResourceRequirements `json:"resources,omitempty"` +} +``` + +Changes from v1alpha1: +- **Removed:** `VolumeSizeLimit` (`volumeLimitSize`) - deprecated in v1alpha1, use `volumeClaimTemplate` instead +- **Renamed:** `resourceRequirements` -> `resources` (consistent across all languages) + +Language-specific extensions remain (Java `extensions`, Go `securityContext`, ApacheHttpd `version`/`configPath`/`attrs`, Nginx `configFile`/`attrs`). + +### 4. Top-Level Resource Configuration + +**Issues:** [#3775](https://github.com/open-telemetry/opentelemetry-operator/issues/3775) + +In v1alpha1, `Resource` (user-defined attributes, `addK8sUIDAttributes`) and `Defaults` (`useLabelsForResourceAttributes`) are separate top-level fields, but both control how the operator populates resource attributes. In v1beta1 these are consolidated into a single top-level `spec.resource` field. + +This field is independent of the SDK configuration mode (`declarativeConfig` vs `envConfig`) because it controls **operator-level injection behavior**, not SDK configuration. The operator injects K8s metadata and service identity attributes following the [OTel Semantic Conventions for K8s attributes](https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/). + +```yaml +apiVersion: opentelemetry.io/v1beta1 +kind: Instrumentation +metadata: + name: resource-example +spec: + resource: + # User-defined resource attributes injected into workloads + attributes: + deployment.environment.name: production + # K8s resource attributes (k8s.pod.name, k8s.namespace.name, k8s.deployment.name, etc.) + # See: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/ + k8sMetadata: + # Set to false to disable K8s resource attribute injection. Defaults to true. + enabled: true + # Include K8s UID attributes (k8s.deployment.uid, k8s.replicaset.uid, etc.) + includeUIDs: true + # Service identity attributes (service.name, service.version, service.namespace, service.instance.id) + # Derived from K8s metadata following OTel semantic conventions precedence. + # See: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/ + serviceMetadata: + # Set to false to disable automatic service attribute derivation. Defaults to true. + enabled: true + envConfig: + exporter: + endpoint: http://collector:4318 +``` + +#### Injection behavior per config mode + +When using `declarativeConfig`, the operator mounts a YAML config file and sets `OTEL_CONFIG_FILE` to point to it. In this mode, [all other OTel environment variables are ignored by the SDK](https://opentelemetry.io/docs/languages/sdk-configuration/declarative-configuration/) unless explicitly referenced via `${VAR}` substitution syntax in the config file. This means `OTEL_RESOURCE_ATTRIBUTES` would not work alongside declarative config. + +The operator handles this differently depending on the active config mode: + +- `envConfig` mode — the operator sets `OTEL_RESOURCE_ATTRIBUTES` env var with the computed attributes (current v1alpha1 behavior). +- `declarativeConfig` mode — the operator merges the computed attributes directly into the `resource.attributes` list in the serialized YAML config file before mounting it into the workload. No `OTEL_RESOURCE_ATTRIBUTES` env var is needed. + +In both cases, user-defined attributes from `spec.resource.attributes` are included with the lowest precedence, followed by K8s metadata, then pod annotations (`resource.opentelemetry.io/*`). + +## CRD Spec + +Full proposed v1beta1 `InstrumentationSpec`: + +```go +type InstrumentationSpec struct { + // DeclarativeConfig defines the OTel SDK configuration as strongly-typed fields + // matching the OTel declarative configuration schema. + // The operator serializes this to a YAML file and mounts it into the workload. + // Mutually exclusive with EnvConfig. + // +optional + DeclarativeConfig *DeclarativeConfig `json:"declarativeConfig,omitempty"` + + // EnvConfig defines the SDK configuration via environment variables. + // This is the same configuration model as v1alpha1 (exporter, sampler, propagators, resource). + // Mutually exclusive with DeclarativeConfig. + // +optional + EnvConfig *EnvConfig `json:"envConfig,omitempty"` + + // Resource defines operator-level resource attribute configuration. + // These settings control how the operator populates resource attributes + // and apply regardless of whether declarativeConfig or envConfig is used. + // +optional + Resource Resource `json:"resource,omitempty"` + + // Env defines common env vars. + // Precedence: original container env > language-specific env > common env > SDK config. + // +optional + Env []corev1.EnvVar `json:"env,omitempty"` + + // Java defines configuration for Java auto-instrumentation. + // +optional + Java Java `json:"java,omitempty"` + + // NodeJS defines configuration for NodeJS auto-instrumentation. + // +optional + NodeJS NodeJS `json:"nodejs,omitempty"` + + // Python defines configuration for Python auto-instrumentation. + // +optional + Python Python `json:"python,omitempty"` + + // DotNet defines configuration for DotNet auto-instrumentation. + // +optional + DotNet DotNet `json:"dotnet,omitempty"` + + // Go defines configuration for Go auto-instrumentation. + // +optional + Go Go `json:"go,omitempty"` + + // ApacheHttpd defines configuration for Apache HTTPD auto-instrumentation. + // +optional + ApacheHttpd ApacheHttpd `json:"apacheHttpd,omitempty"` + + // Nginx defines configuration for Nginx auto-instrumentation. + // +optional + Nginx Nginx `json:"nginx,omitempty"` + + // ImagePullPolicy defines the image pull policy for init containers. + // +optional + ImagePullPolicy corev1.PullPolicy `json:"imagePullPolicy,omitempty"` + + // InitContainerSecurityContext applied to auto-instrumentation init containers. + // +optional + InitContainerSecurityContext *corev1.SecurityContext `json:"initContainerSecurityContext,omitempty"` +} + +// DeclarativeConfig mirrors the OTel SDK configuration schema as strongly-typed Go structs. +// See https://github.com/open-telemetry/opentelemetry-configuration for the full schema. +// The exact struct definitions will be generated from or aligned with the upstream schema. +type DeclarativeConfig struct { + // FileFormat is the OTel configuration schema version (e.g. "0.4"). + FileFormat string `json:"file_format"` + + // Disabled controls whether the SDK is disabled. + // +optional + Disabled *bool `json:"disabled,omitempty"` + + // Resource defines resource attributes configuration. + // +optional + Resource *ResourceConfig `json:"resource,omitempty"` + + // Propagator defines context propagation configuration. + // +optional + Propagator *PropagatorConfig `json:"propagator,omitempty"` + + // TracerProvider defines tracer provider configuration (samplers, processors, exporters). + // +optional + TracerProvider *TracerProviderConfig `json:"tracer_provider,omitempty"` + + // MeterProvider defines meter provider configuration (readers, views). + // +optional + MeterProvider *MeterProviderConfig `json:"meter_provider,omitempty"` + + // LoggerProvider defines logger provider configuration (processors, exporters). + // +optional + LoggerProvider *LoggerProviderConfig `json:"logger_provider,omitempty"` +} + +// EnvConfig defines the env-var-based SDK configuration (same as v1alpha1 top-level fields). +type EnvConfig struct { + // Exporter defines exporter configuration. + // +optional + Exporter Exporter `json:"exporter,omitempty"` + + // Propagators defines inter-process context propagation configuration. + // +optional + Propagators []Propagator `json:"propagators,omitempty"` + + // Sampler defines sampling configuration. + // +optional + Sampler Sampler `json:"sampler,omitempty"` +} + +// Resource defines operator-level resource attribute configuration. +// These fields control how the operator populates resource attributes and +// are independent of the SDK configuration mode (declarativeConfig vs envConfig). +// See: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/ +type Resource struct { + // Attributes defines resource attributes to inject into the workload. + // +optional + Attributes map[string]string `json:"attributes,omitempty"` + + // K8sMetadata controls K8s resource attribute injection (k8s.pod.name, k8s.namespace.name, etc.). + // +optional + K8sMetadata *K8sMetadataConfig `json:"k8sMetadata,omitempty"` + + // ServiceMetadata controls service identity attribute derivation (service.name, service.version, etc.). + // +optional + ServiceMetadata *ServiceMetadataConfig `json:"serviceMetadata,omitempty"` +} + +// K8sMetadataConfig defines how Kubernetes resource attributes are injected. +// Controls attributes like k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.node.name, etc. +type K8sMetadataConfig struct { + // Enabled controls whether K8s resource attributes are automatically injected. + // When false, no k8s.* attributes are added. Defaults to true. + // +optional + Enabled *bool `json:"enabled,omitempty"` + + // IncludeUIDs defines whether K8s UID attributes should be collected + // (e.g. k8s.deployment.uid, k8s.replicaset.uid). Only applies when Enabled is true. + // +optional + IncludeUIDs bool `json:"includeUIDs,omitempty"` +} + +// ServiceMetadataConfig defines how service identity attributes are derived from K8s metadata. +// Controls attributes: service.name, service.version, service.namespace, service.instance.id. +// Follows OTel semantic conventions precedence: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/ +type ServiceMetadataConfig struct { + // Enabled controls whether service identity attributes are automatically derived. + // When false, no service.* attributes are added by the operator. Defaults to true. + // +optional + Enabled *bool `json:"enabled,omitempty"` +} +``` + +The exact child types for `DeclarativeConfig` (`ResourceConfig`, `TracerProviderConfig`, etc.) will be defined to match the [OTel configuration schema](https://github.com/open-telemetry/opentelemetry-configuration). + +The `EnvConfig` types (`Exporter`, `Sampler`, `Propagator`) are the same as v1alpha1, just moved under `spec.envConfig`. The `Exporter` type gains a new `protocol` field. The `Resource` type is promoted to the top level of `InstrumentationSpec` since it controls operator-level injection behavior that applies to both config modes. + +## Breaking Changes from v1alpha1 + +| Change | v1alpha1 | v1beta1 | Migration | +|--------|----------|---------|-----------| +| SDK configuration | `exporter`, `sampler`, `propagators`, `resource` at top level | `exporter`, `sampler`, `propagators` moved under `spec.envConfig`, or use new `spec.declarativeConfig` | Wrap existing fields under `envConfig`, or migrate to declarative config | +| Resource attributes | `spec.resource.resourceAttributes`, `spec.resource.addK8sUIDAttributes`, `spec.defaults.useLabelsForResourceAttributes` | Consolidated into `spec.resource.attributes`, `spec.resource.k8sMetadata`, and `spec.resource.serviceMetadata` | Rename `resourceAttributes` to `attributes`, use `k8sMetadata` for k8s.* attributes, use `serviceMetadata` for service.* derivation | +| Per-language resources JSON tag | Mixed (`resources` / `resourceRequirements`) | `resources` (all) | Rename in YAML for NodeJS, Python, DotNet, Go, ApacheHttpd, Nginx | +| Volume size limit | `volumeLimitSize` (deprecated) | Removed | Use `volumeClaimTemplate` | + +## Migration Strategy + +1. **Conversion webhook**: Implement a conversion webhook that translates between v1alpha1 and v1beta1, handling field renames and removals automatically. +2. **Dual-version support**: Serve both v1alpha1 and v1beta1 simultaneously with v1beta1 as the storage version. +3. **Deprecation timeline**: v1alpha1 is deprecated when v1beta1 ships. + +## Related Issues + +| Category | Issues | +|----------|--------| +| Label selectors (future) | [#2744](https://github.com/open-telemetry/opentelemetry-operator/issues/2744), [#821](https://github.com/open-telemetry/opentelemetry-operator/issues/821), [#4445](https://github.com/open-telemetry/opentelemetry-operator/issues/4445) | +| Declarative config | [#4093](https://github.com/open-telemetry/opentelemetry-operator/issues/4093), [#4607](https://github.com/open-telemetry/opentelemetry-operator/issues/4607) | +| Exporter improvements | [#3658](https://github.com/open-telemetry/opentelemetry-operator/issues/3658), [#3390](https://github.com/open-telemetry/opentelemetry-operator/issues/3390), [#2180](https://github.com/open-telemetry/opentelemetry-operator/issues/2180) | +| Env handling | [#3022](https://github.com/open-telemetry/opentelemetry-operator/issues/3022), [#3775](https://github.com/open-telemetry/opentelemetry-operator/issues/3775), [#4559](https://github.com/open-telemetry/opentelemetry-operator/issues/4559) | +| Security context | [#2272](https://github.com/open-telemetry/opentelemetry-operator/issues/2272), [#2053](https://github.com/open-telemetry/opentelemetry-operator/issues/2053) | +| API stability | [#5060](https://github.com/open-telemetry/opentelemetry-operator/issues/5060) | +| Resource attributes | [#3775](https://github.com/open-telemetry/opentelemetry-operator/issues/3775), [#938](https://github.com/open-telemetry/opentelemetry-operator/issues/938) | +| Rollout on change | [#553](https://github.com/open-telemetry/opentelemetry-operator/issues/553) | From e780a1b0a9ef260e46e531ebdd7de9d6a7377fc1 Mon Sep 17 00:00:00 2001 From: Pavol Loffay Date: Wed, 20 May 2026 17:11:57 +0200 Subject: [PATCH 2/4] Some review comments Signed-off-by: Pavol Loffay --- docs/rfcs/instrumentation-v1beta1.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/rfcs/instrumentation-v1beta1.md b/docs/rfcs/instrumentation-v1beta1.md index dd37d38f73..1940d12d07 100644 --- a/docs/rfcs/instrumentation-v1beta1.md +++ b/docs/rfcs/instrumentation-v1beta1.md @@ -36,6 +36,20 @@ OpenTelemetry is standardizing on [file-based declarative configuration](https:/ Setting both `declarativeConfig` and `envConfig` is invalid and rejected by the webhook. +#### Language support + +Not all language SDKs support declarative configuration yet. See [language-support-status.md](https://github.com/open-telemetry/opentelemetry-configuration/blob/main/language-support-status.md) for the current status. + +- **Java** — supported +- **Python** — supported ([#4856](https://github.com/open-telemetry/opentelemetry-python/issues/4856)) +- **Node.js** — in progress +- **Go** — in progress +- **.NET** — not supported ([#6380](https://github.com/open-telemetry/opentelemetry-dotnet/issues/6380)) + +The operator will **skip injection** for languages that do not support declarative configuration when `spec.declarativeConfig` is set. For example, using `declarativeConfig` with `.NET` auto-instrumentation will result in the pod mutation webhook skipping injection and emitting a warning event. + +The operator will have a flag to control declarative configuration support (e.g., `--instrumentation-declarative-config=java,python,dotnet,cpp,nodejs`). This allows enabling support as SDKs mature without requiring operator upgrades. + #### Declarative config example ```yaml From 73494618d720a5930808842e2c38c05c6e8ae236 Mon Sep 17 00:00:00 2001 From: Pavol Loffay Date: Thu, 21 May 2026 17:12:09 +0200 Subject: [PATCH 3/4] Add CRD versioning explanation Signed-off-by: Pavol Loffay --- docs/rfcs/multiple-crd-versions.md | 196 +++++++++++++++++++++++++++++ 1 file changed, 196 insertions(+) create mode 100644 docs/rfcs/multiple-crd-versions.md diff --git a/docs/rfcs/multiple-crd-versions.md b/docs/rfcs/multiple-crd-versions.md new file mode 100644 index 0000000000..c1c0ce4f88 --- /dev/null +++ b/docs/rfcs/multiple-crd-versions.md @@ -0,0 +1,196 @@ +# Multiple CRD Versions + +This document outlines how Kubernetes handles multiple CRD versions, strategies for operator maintainers, and lessons learned from other projects. + +## Background + +When graduating a CRD from `v1alpha1` to `v1beta1` (or `v1beta1` to `v1`), operators face a choice: how to handle the transition for existing users? Kubernetes supports serving multiple versions of the same CRD simultaneously, but this comes with complexity. + +## Kubernetes CRD Versioning Basics + +### Storage Version + +Only one version can be the **storage version** — the version persisted in etcd. All other versions are converted to/from this version. + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +spec: + versions: + - name: v1alpha1 + served: true + storage: false # Not stored, converted from v1beta1 + - name: v1beta1 + served: true + storage: true # Stored in etcd +``` + +### Served Versions + +The `served` field controls whether the API server accepts requests for that version. + +**When `served: true`:** +- Clients can create, read, update, and delete resources using that version (e.g. `instrumentations.v1alpha1.opentelemetry.io`) +- Resources are auto-converted to/from the storage version + +**When `served: false`:** +- API server returns 404 for that version's endpoint +- `kubectl get instrumentations.v1alpha1.opentelemetry.io` fails +- Existing resources in etcd are still accessible via served versions — with `strategy: None`, the API server just swaps the `apiVersion` field (requires identical schemas) +- New resources cannot be created using that version + +### Conversion Strategies + +| Strategy | When to Use | +|----------|-------------| +| `None` | Schemas are identical (only apiVersion differs) | +| `Webhook` | Schemas differ (field renames, restructuring, removals) | + +If schemas differ and you use `None`, data won't map correctly between versions: +- **Renamed fields**: `foo` in `v1alpha1` won't appear in `bar` in `v1beta1` — appears empty +- **Restructured fields**: `spec.exporter.endpoint` won't map to `spec.envConfig.exporter.endpoint` +- **Removed fields**: Data preserved in etcd but invisible in new schema + +### How Controllers Handle Multiple Versions + + +**Example: No conversion (identical schemas)** + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +spec: + conversion: + strategy: None +``` + +**Example: Webhook conversion** + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +spec: + conversion: + strategy: Webhook + webhook: + conversionReviewVersions: ["v1"] # ConversionReview API versions the webhook accepts (not CRD versions) + clientConfig: + service: + namespace: opentelemetry-operator-system + name: opentelemetry-operator-webhook + path: /convert +``` + +## Strategy 1: Conversion Webhook + +Implement a webhook that converts between versions automatically. + +### Pros + +- **Seamless migration**: Users can continue using old version, resources auto-convert +- **No forced migration**: Users upgrade at their own pace +- **Backwards compatible**: Old tools/scripts continue working + +### Cons + +- **Deployment complexity**: Webhook requires TLS certificates, secrets, and firewall rules on GKE private clusters (default firewall only allows ports 443/10250 from control plane to nodes — webhooks on other ports require custom firewall rules) +- **Maintenance burden**: Must maintain conversion logic +- **Helm complexity**: Webhooks need complex orchestration in Helm charts (OpenTelemetry operator solved this with templated CRDs — see [UPGRADING.md](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-operator/UPGRADING.md)) + +### OpenTelemetry Collector `v1alpha1` → `v1beta1` Experience + +The OpenTelemetry Operator implemented a conversion webhook for OpenTelemetryCollector v1alpha1 → v1beta1. Key issues encountered: + +**Helm chart complications:** +- Webhook service name must be templated for custom Helm release names ([helm-charts#1167](https://github.com/open-telemetry/opentelemetry-helm-charts/issues/1167)) +- Users get "service opentelemetry-operator-webhook not found" errors ([helm-charts#1199](https://github.com/open-telemetry/opentelemetry-helm-charts/issues/1199)) + +*Why this happens:* CRDs are cluster-scoped with hardcoded webhook service references, but Helm prefixes resource names with the release name (e.g., `helm install my-otel ...` creates `my-otel-opentelemetry-operator-webhook`). The CRD references `opentelemetry-operator-webhook`, but the actual service has a different name. + +**OLM install mode restriction:** +- Only `AllNamespaces` install mode supported (operator watches all namespaces) — CRDs are cluster-scoped, so conversion webhooks must handle resources from all namespaces, incompatible with `OwnNamespace` mode. OLM v1 is moving away from install modes entirely, but the fundamental constraint remains: conversion webhooks are cluster-scoped. + +## Strategy 2: Identical Schemas + +Make all breaking changes while still in alpha, then graduate with identical schemas. + +When using `strategy: None`, no separate controllers are needed per version: + +1. User creates resource using any served version (e.g., `v1alpha1`) +2. API server converts to storage version by changing `apiVersion` field +3. Resource is persisted in etcd as storage version (e.g., `v1beta1`) +4. Controller watches only the storage version using a single Go struct +5. When user reads with old version, API server converts back on the fly + +The operator code remains unchanged — it reconciles only the storage version. The API server handles all version transformations transparently. + +### Approach + +1. Make all breaking changes in `v1alpha1` while it's still alpha (breaking changes are expected) +2. When schema is finalized, graduate to `v1beta1` with identical schema +3. Use conversion strategy `None` — only `apiVersion` changes +4. No conversion webhook needed + +### Pros + +- No conversion webhook complexity +- No maintenance burden for conversion logic +- Clear expectations — both versions behave identically +- Simple Helm/deployment — no webhook TLS/firewall concerns + +### Cons + +- **Breaking changes in alpha** — users on v1alpha1 must update their manifests +- **No automatic migration** — users must manually update `apiVersion` + +## Cert-Manager `cmctl` Approach + +Cert-manager used conversion webhooks for their core CRDs (`Certificate`, `Issuer`, `ClusterIssuer`, `CertificateRequest`) during the transition period while multiple versions were served. They had breaking changes between versions: + +- **API group rename**: `certmanager.k8s.io` → `cert-manager.io` +- **Field removals**: `certificate.spec.acme`, `issuer.spec.http01`, `issuer.spec.dns01` +- **Field restructuring**: challenge solver configuration moved to new location + +In addition to the runtime conversion webhook, they provide `cmctl convert` — an offline CLI tool for migrating stored manifests before upgrading. + +**Version progression:** `v1alpha2` → `v1alpha3` → `v1beta1` → `v1` + +| cert-manager | Storage | Served | Notes | +|--------------|---------|--------|-------| +| v1.0 - v1.3 | `v1` | `v1`, `v1beta1`, `v1alpha3`, `v1alpha2` | All versions served | +| v1.4 - v1.5 | `v1` | `v1`, `v1beta1`, `v1alpha3`, `v1alpha2` | Old versions deprecated | +| v1.6 | `v1` | `v1` only | Old versions no longer served | +| v1.7+ | `v1` | `v1` only | Old versions removed from CRD | + +### How It Works + +```bash +# Convert a single file +cmctl convert -f old-certificate.yaml > new-certificate.yaml + +# Convert and apply directly +cmctl convert -f old-certificate.yaml | kubectl apply -f - + +# Convert entire directory +cmctl convert -f ./manifests/ --output-dir ./converted/ +``` + +The tool: +1. Parses input YAML with old API version +2. Maps old fields to new field names/locations +3. Applies defaults for new required fields +4. Outputs valid YAML for the new API version + +### References + +- [Migrating Deprecated API Resources](https://cert-manager.io/docs/releases/upgrading/remove-deprecated-apis/) — official migration guide +- [Upgrading from v0.16 to v1.0](https://cert-manager.io/docs/installation/upgrading/upgrading-0.16-1.0/) — major version upgrade guide +- [Issue #4686: Make cmctl upgrade old API versions](https://github.com/cert-manager/cert-manager/issues/4686) — discussion on migration tooling + +## References + +- [Kubernetes CRD Versioning](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/) +- [Prometheus Operator ScrapeConfig Graduation](https://prometheus-operator.dev/docs/proposals/accepted/scrapeconfig-graduation/) +- [Prometheus Operator AlertmanagerConfig v1beta1 Issue](https://github.com/prometheus-operator/prometheus-operator/issues/4677) +- [Helm Charts v1beta1 Missing Issue](https://github.com/prometheus-community/helm-charts/issues/5168) +- [Kubernetes Bug: Conversion for Unserved Versions](https://github.com/kubernetes/kubernetes/issues/129979) From 1f976ebe05099c0ea89227c12e090a79ee54118c Mon Sep 17 00:00:00 2001 From: Pavol Loffay Date: Thu, 21 May 2026 17:24:01 +0200 Subject: [PATCH 4/4] Add CRD versioning explanation Signed-off-by: Pavol Loffay --- docs/rfcs/multiple-crd-versions.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/multiple-crd-versions.md b/docs/rfcs/multiple-crd-versions.md index c1c0ce4f88..63b7d82550 100644 --- a/docs/rfcs/multiple-crd-versions.md +++ b/docs/rfcs/multiple-crd-versions.md @@ -10,7 +10,7 @@ When graduating a CRD from `v1alpha1` to `v1beta1` (or `v1beta1` to `v1`), opera ### Storage Version -Only one version can be the **storage version** — the version persisted in etcd. All other versions are converted to/from this version. +Only one version can be the **storage version** - the version persisted in etcd. All other versions are converted to/from this version. ```yaml apiVersion: apiextensions.k8s.io/v1 @@ -36,7 +36,7 @@ The `served` field controls whether the API server accepts requests for that ver **When `served: false`:** - API server returns 404 for that version's endpoint - `kubectl get instrumentations.v1alpha1.opentelemetry.io` fails -- Existing resources in etcd are still accessible via served versions — with `strategy: None`, the API server just swaps the `apiVersion` field (requires identical schemas) +- Existing resources in etcd are still accessible via served versions - with `strategy: None`, the API server just swaps the `apiVersion` field (requires identical schemas) - New resources cannot be created using that version ### Conversion Strategies @@ -51,8 +51,7 @@ If schemas differ and you use `None`, data won't map correctly between versions: - **Restructured fields**: `spec.exporter.endpoint` won't map to `spec.envConfig.exporter.endpoint` - **Removed fields**: Data preserved in etcd but invisible in new schema -### How Controllers Handle Multiple Versions - +#### CRD conversions examples **Example: No conversion (identical schemas)**