-
Notifications
You must be signed in to change notification settings - Fork 31
TelemetryPolicy proposal #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 15 commits
6bf817c
972f464
40e812f
644b87f
26b328a
af83a0d
c5aed7d
167372c
d402594
71f4338
4943c76
eb826bc
38d78dc
04667c7
7589829
297b6b3
1a39810
d1f0564
9fa2624
35eb945
8600d0b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,317 @@ | ||
| Date: 9th February 2026<br/> | ||
| Authors: gkhom<br/> | ||
| Status: draft<br/> | ||
|
|
||
| # TelemetryPolicy | ||
| A Kubernetes API for Gateway/Mesh Observability | ||
|
|
||
| ## Summary | ||
| This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) | ||
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). | ||
|
|
||
| This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am (acting as) a naive reader, and I was immediately curious what some examples of these vendor-specific CRDs are. This also ties to and might clarify the below mention of "Observability lock-in".
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Examples of such CRDs are:
I intend to write a section that compares such existing APIs and the proposed TelemetryPolicy in the eventual proposal.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seeing as there's a mix of examples here, will the scope cover one resource for all of the signals (metrics, logs, traces) vs. separate ones? Are there tradeoffs to consider here?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm leaning towards one resource for all. The argument for splitting them might be that different personas are involved in configuring the different aspects of observability. In practice, I think that the persona that configures metrics, likely also configures tracing and access logs. So to avoid complicating the API with three additional resources, it seems worthwhile to put all of it in a single resource. |
||
| with a unified, portable spec. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you see implementations reconciling the TelemetryPolicy and reading the bits that are relevant to their components? So multiple controllers read the CR and take actions to enable telemetry across the components they are controlling?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is indeed possible to distribute the responsibility across multiple controllers, it's up to the implementation. In most cases that I'm familiar with a single controller/control plane programs all three observability features (metrics, traces, logs).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. possible but a little challenging, what are the cases we see multiple impls reconcile the same thing? |
||
|
|
||
| # Context | ||
| ## The Fragmentation of Observability | ||
| In the current Kubernetes landscape, the “Who, What, Where, and How Long” of network traffic is answered differently depending on the underlying | ||
| proxy technology. While the Gateway API specification has unified how traffic is routed via `HTTPRoute` and `Gateway`, it has deferred the standardization | ||
| of how that traffic is observed. | ||
| This deferral has led to "Observability Lock-in". Platform Engineering teams are forced to learn and manage distinct APIs for each environment. | ||
| A standardized `TelemetryPolicy` is necessary to decouple the intent of observability from the implementation. Without such standardization it is | ||
| difficult for platform owners to: | ||
|
|
||
| 1. Enforce consistent auditing standards across different infrastructure providers. | ||
| 2. Support emerging workloads like AI Agents, which require specialized metrics (e.g., token usage, model latency) and detailed audit logs for tool-use verification. | ||
| 3. Manage “Mesh” and “Gateway” observability with a single unified API. | ||
|
|
||
| ## The Emergence of Agentic Networking | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we really need a callout for this here, given that in the rest of the document is just specified that there is a need for a Telemetry API standardization, regardless of being Agentic Networking or not? |
||
|
|
||
| The most pressing driver for this proposal is the shift in traffic patterns introduced by agentic workloads. We are moving from a deterministic Service-to-Service | ||
| paradigm to a non-deterministic Agent-to-Tool and Agent-to-Agent paradigm. | ||
|
|
||
| In an Agentic Mesh: | ||
| * **Entities are Autonomous**: An AI Agent (Pod) decides entirely on its own to call a Tool (Service). | ||
| * **Cost is Volatile**: Usage is measured in tokens, not just requests. A single HTTP 200 OK could cost $0.01 or $10.00 depending on the prompt and model used. | ||
| * **Context is King**: Debugging requires knowing the semantic context: Which Model? Which Prompt? Which tool? | ||
|
|
||
| Existing telemetry policies are unaware of the emerging Generative AI semantic conventions. They see an opaque TCP stream or HTTP POST. Without a standardized API to | ||
| configure the extraction and export of these attributes, the “Agentic Mesh” will remain a black box, increasing governance and cost control challenges. | ||
|
|
||
| ## Design Objectives | ||
|
|
||
| To address these challenges, the `TelemetryPolicy` proposal targets four core objectives: | ||
|
|
||
| 1. **Standardization**: A single API for Gateway and Mesh to configure Access Logging, Metrics generation, and Tracing propagation. | ||
| 2. **GEP-713 Compliance**: Support `targetRef` attachment to `Gateway` and `Namespace`. The latter covers Mesh use-cases. | ||
| 3. **Agentic Support**: Enable the capture of OpenTelemetry GenAI Semantic Conventions and support the requirements of PR #33. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the spirit of standardization and not reinventing the wheel, I wanted to mention that the llm-d community is already moving on tracing + OTel + GenAI Semantic Conventions. In particular, Sally O'Malley from Red Hat proposed and did a POC for distributed tracing in llm-d. [aside: I learned about this work from Sally on another community call for our kagenti project] This may be applicable here for a few reasons:
While llm-d is focused on distributed LLM inferencing regardless of source (i.e., user chat -> LLM vs agent -> LLM), I think it's worth considering any lessons they may have already encountered and API definitions that could overlap with our case, at the very least at the Gateway level. I'd be willing to evangelize our thinking to Sally to get her thoughts, but more importantly curious on our interest level.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would certainly be valuable to get some of their insights and experiences. The proposal seems to cover configuration through environment variables, have they defined CRDs as well?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I did not see any CRD definitions. I'll keep this thread in mind as the definitions become more concrete. |
||
| 4. **Protocol Agnostic**: Support OpenTelemetry as the primary data model while allowing vendor-specific extensions. | ||
|
|
||
| ## The TelemetryPolicy Specification | ||
|
|
||
| We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should it belong to the agentic API group or maybe start as such while it's proposed in the scope of the subproject?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, will update. |
||
|
|
||
| ### Resource Structure | ||
|
|
||
| The following is an example that demonstrates the structure of the `TelemetryPolicy`. | ||
|
|
||
| ```yaml | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we also have the status specification please? |
||
| apiVersion: gateway.networking.k8s.io/v1alpha2 | ||
| kind: TelemetryPolicy | ||
| metadata: | ||
| name: standard-telemetry | ||
| namespace: prod-ns | ||
| spec: | ||
| # GEP-713 Attachment | ||
| targetRefs: | ||
| - group: gateway.networking.k8s.io | ||
| kind: Gateway | ||
| name: my-gateway | ||
|
|
||
| # 1. Tracing Configuration | ||
| tracing: | ||
| provider: | ||
| type: OTLP # or implementation-specific | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is type a specific go type? would be useful to define something like type TracingProviderType string
const (
// OTLPTracingProvider is used to ....
OTLPTracingProvider TracingProviderType = "OTLP"
)And then an implementation specific would probably be a vendor-prefixed thing like Would probably be useful to have the go types as part of this PR like other GEPs in Gateway API
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added Go types in the "Detailed resource description" section. I'm not sure that we need to include specific implementation providers as part of the API spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot come up with any good reason to support anything but OTLP for tracing... |
||
| endpoint: "otel-collector.monitoring.svc:4317" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this any url basically?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, in this example it's the URL of the OTLP endpoint. |
||
| samplingRate: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to explain how sampling of traces work. Is this respecting the existing "sampling" decisions? Is this for requests without an existing context?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a brief explanation. This is the base sampling rate across all requests. The optional |
||
| percent: 5 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would benefit from a go struct. We had loads of discussion on how to do percentages in gateway api. I could not find the long thread, I think it was in a meeting. This is the best thing I could fine kubernetes-sigs/gateway-api#3178 but maybe @robscott has something more concrete
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added Go structs in the "Detailed resource description" section. |
||
| parentBasedSampling: | ||
| enabled: true | ||
| samplingRate: | ||
| percent: 50 | ||
| context: | ||
| - W3C | ||
| - B3 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will likely benefit from comments
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comments in the Go struct spec. Should we add it here as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please stop encouraging the use of ancient tracing headers. Just use OTLP w3c context and ignore everything else. |
||
| customAttributes: | ||
| - attributeName: "env" | ||
| literalValue: "production" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit, whats the rationale behind literalValue vs just value?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is to make it explicit that this is a static value for the attribute. In the future, we could also consider dynamic attributes that are derived at runtime. |
||
|
|
||
| # 2. Metrics Configuration | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think its questionable how portable dynamic configuration of metrics is. I've only seen Envoy do this, and even then its extremely bug ridden historically. Its very confusing for users what the semantics are, or should be, to add or remove labels from metrics. Or even adding a metric -- imagine I have a dashboard like |
||
| metrics: | ||
| enabled: true | ||
| provider: | ||
| type: Prometheus | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment on go struct and types There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO for metrics, just use OTLP. It's 2026... It's interchangeable with prometheus and OTLP is here to stay. I'd drop provider to avoid any kind of vendor interference here. |
||
| overrides: | ||
| - name: "gateway.networking.k8s.io/http/request_count" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these metric names standardised somehow or the alignment between who owns the policy and who consumes the metrics is something we expect to happen but out of scope of the proposal?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This specific metric name is just an arbitrary example. The closest thing to a standard for metric names would be OTel's semantic conventions. The alignment between policy owner and metric consumer is indeed out-of-scope (but I'm happy to try to include it if needed).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick but can you use an obviously dummy name to avoid making it seem like its a real proposal? like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please do not use official apis without API approval, and as John says for examples use unambiguous domains |
||
| type: Counter | ||
| dimensions: # Custom labels/dimensions | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this part of the API is slightly confusing and seems somehow like a Prometheus rewrite rule. Is this something supported by OTEL, or some assumption that the Gateway implementation will do the rewrites/overrides here? |
||
| - key: "model_id" | ||
| fromHeader: "x-model-id" # Crucial for Agentic workloads | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any other possible sources in mind? E.g.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, |
||
|
|
||
| # 3. Access Logging | ||
| accessLogs: | ||
| enabled: true | ||
| format: JSON | ||
|
gkhom marked this conversation as resolved.
Outdated
|
||
| matches: # Conditional logging | ||
| - path: "/api/v1/sensitive" | ||
|
gkhom marked this conversation as resolved.
Outdated
|
||
| - cel: "response.code >= 500" # CEL-based filtering for errors | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we thinking the same variables described at https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0017-DynamicAuth.md#available-context-variables?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question, those should be included but we will likely need more. For example, a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CEL is great for the long tail, but it's too slow and obscure as the primary method. It'll add 3-5x overhead to basic matching.
Comment on lines
+99
to
+100
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we only want CEL its a bit awkward to have a list. But may make sense if we have non-CEL
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing I am missing on this proposal is what is required and what is extended. Given this may become a Gateway API specification, we should at least define what is expected to be a core feature or not for this API. eg.: The CEL matching here may not be implementable by all shippers. |
||
| fields: # Configure specific fields to include | ||
| - "start_time" | ||
| - "response_code" | ||
| - "x-token-usage" | ||
|
Comment on lines
+101
to
+104
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any definition of what these fields mean? For a concrete example, say I want to log the MCP task name (I chose this since its not in https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/). Can I do it? what do I put as the field if I want to? |
||
| ``` | ||
|
|
||
| ### Policy Attachment | ||
|
|
||
| Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments: | ||
|
|
||
| 1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`. | ||
| 2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What Namespace targets? Are Gateways excluded from targeting (note that some Gateways are in cluster and some outside the cluster). Would be good to have some user journeys that we want to allow here with the configuration. FWIW - we have went through a similar exercise in Gateway API for AuthZ and here is the conclusion that landed (note that this is still experimental though). Here it is for more context -- https://github.com/kubernetes-sigs/gateway-api/pull/3891/changes#diff-6886a6f78647100500384beb636df7b6487717be6d9f8366f50d8a0bd3581927R196-R238 @guicassolato have tons of experience in this as well, as it was initially proposed as part of GEP-713.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My current thinking is that targeting a namespace implies a "mesh" target i.e., it would include all proxies in the namespace except gateways. I think it's reasonable to assume that different observability configurations may be desired/applicable to Gateway use-cases (north/south) compared to Mesh use-cases (east/west). That's why I think it's better to avoid making namespace a target that captures both.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Defining what namespace targeting means and even if that definition points to mesh use case only is fine. But it has to be more than implied IMO. It has to be by design and well specified/documented, so all implementations of the API will commit to the same meaning and behaviour. (I believe that's what @gkhom has in mind, but good to spell it out, I think.) On a side note, if namespace targeting is for the mesh use case, have you considered the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although implied when calling it "direct policy", I think it may useful to add a note about the merge semantics, which I imagine will be the |
||
|
|
||
| #### Alternatives Considered | ||
|
|
||
| ##### GatewayClass | ||
|
|
||
| Targeting `GatewayClass` would set the default telemetry configurations for all Gateways of a specific class. While this would provide a powerful mechanism, the challenge is that `GatewayClass` is a cluster-scoped entity whereas `TelemetryPolicy` is namespace-scoped. Allowing a namespace-scoped resource to influence the behavior of an entire cluster introduces significant operational and security risks. We would also need to define the semantics in the presence of multiple `TelemetryPolicy` resources that target the same `GatewayClass`. This is out of scope for this proposal. | ||
|
|
||
| ##### Route | ||
|
|
||
| Future iterations could support attachment directly to routes (e.g., `HTTPRoute`). This will allow specific telemetry configuration for critical paths or specific API endpoints. To maintain API simplicity in the initial proposal, this is deferred to a future proposal. | ||
|
|
||
| ##### Workload | ||
|
|
||
| We evaluated the ability to target specific workloads directly using pod label selectors. This would allow for precise application of telemetry settings to specific groups of pods (e.g., forcing debug logging on a specific deployment). However, we are prioritizing namespace-level attachment for mesh use-cases to align with existing Gateway API patterns. | ||
|
|
||
| ##### Service | ||
|
|
||
| Attachment to a `Service` is deferred because a `Service` resource primarily defines the "exposure" or inbound side of a workload. It is not intuitive for a policy attached to an inbound definition to configure telemetry for both inbound and outbound traffic. Additionally, since multiple Services can select the same Pod, resolving precedence or merging strategies when different `TelemetryPolicy` resources target those different Services introduces significant complexity. | ||
|
|
||
| ### Detailed Resource Description | ||
|
|
||
| The following are the Go structs modeling the proposed specification: | ||
|
|
||
| ```Go | ||
| // TelemetryPolicy defines a direct policy attachment to configure observability | ||
| // signals for Gateway API resources and Service Mesh resources. | ||
| type TelemetryPolicy struct { | ||
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
|
||
| Spec TelemetryPolicySpec `json:"spec"` | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should probably define a
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I've added the status stanza. |
||
|
|
||
| type TelemetryPolicySpec struct { | ||
| // Identifies the target resources (Gateway or Namespace) to which this policy attaches (GEP-713). | ||
| TargetRefs []TargetReference `json:"targetRefs"` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could probably reuse here one of the Gateway API types from https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/policy_types.go. I wonder, for example, if namespace targeting should be allowed only in the same namespace (
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm leaning towards
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Allowing cross namespace entirely violates namespace boundaries. namespace X should definitely not be able to modify namespace Y's configuration. If we want uniform management then IMO the correct way to do this would be to modify a global object, probably This hasn't been done in the Gateway API space so would be novel
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should definitely not start with with allowing cross-ns. Regarding if and how to allow it, I also usually dont like this idea of controlling other telemetry endpoints from other namespace. We could either adopt |
||
|
|
||
| // Configuration for distributed tracing options. | ||
| Tracing *TracingConfig `json:"tracing,omitempty"` | ||
|
|
||
| // Configuration for metric generation and exports. | ||
| Metrics *MetricsConfig `json:"metrics,omitempty"` | ||
|
|
||
| // Configuration for access log generation. | ||
| AccessLogs *AccessLogsConfig `json:"accessLogs,omitempty"` | ||
| } | ||
|
|
||
| // --- Tracing Types --- | ||
|
|
||
| type TracingConfig struct { | ||
| // Specifies the tracing backend. Includes type (e.g., "OTLP") and endpoint. | ||
| Provider *TracingProvider `json:"provider,omitempty"` | ||
|
|
||
| // The base sampling probability for traces. | ||
| SamplingRate *Fraction `json:"samplingRate,omitempty"` | ||
|
|
||
| // Configures whether to respect the sampling decision of the parent span. | ||
| ParentBasedSampling *ParentBasedSampling `json:"parentBasedSampling,omitempty"` | ||
|
|
||
| // Specifies the context propagation formats to use (e.g., W3C, B3, Jaeger). | ||
| Context []string `json:"context,omitempty"` | ||
|
|
||
| // Allows appending custom tags/attributes to spans. | ||
| CustomAttributes []CustomAttribute `json:"customAttributes,omitempty"` | ||
| } | ||
|
|
||
| type TracingProvider struct { | ||
| Type string `json:"type"` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this be an enum or completely free for the implementations to define? Maybe some |
||
| Endpoint string `json:"endpoint,omitempty"` | ||
| } | ||
|
|
||
| type Fraction struct { | ||
| Percent int32 `json:"percent,omitempty"` | ||
| } | ||
|
|
||
| type ParentBasedSampling struct { | ||
| Enabled bool `json:"enabled"` | ||
| SamplingRate *Fraction `json:"samplingRate,omitempty"` | ||
| } | ||
|
|
||
| type CustomAttribute struct { | ||
| AttributeName string `json:"attributeName"` | ||
| LiteralValue string `json:"literalValue"` | ||
| } | ||
|
|
||
| // --- Metrics Types --- | ||
|
|
||
| type MetricsConfig struct { | ||
| // Global switch to enable or disable metric generation. | ||
| Enabled bool `json:"enabled"` | ||
|
|
||
| // Specifies the metrics backend (e.g., Prometheus). | ||
| Provider *MetricsProvider `json:"provider,omitempty"` | ||
|
|
||
| // List of configurations to customize specific metric families. | ||
| Overrides []MetricOverride `json:"overrides,omitempty"` | ||
| } | ||
|
|
||
| type MetricsProvider struct { | ||
| Type string `json:"type"` | ||
| } | ||
|
|
||
| type MetricOverride struct { | ||
| // The metric name to override (e.g., "http_requests_total" or "gateway.networking.k8s.io/http/request_count"). | ||
| Name string `json:"name"` | ||
| Type string `json:"type,omitempty"` | ||
| // Defines custom dimensions (labels). Can extract values from headers. | ||
| Dimensions []Dimension `json:"dimensions,omitempty"` | ||
| } | ||
|
|
||
| type Dimension struct { | ||
| Key string `json:"key"` | ||
| FromHeader string `json:"fromHeader,omitempty"` | ||
| } | ||
|
Comment on lines
+220
to
+223
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 3 APIs all have the same property of "add a K/V" pair but do so in 3 different ways. Does it make sense? Should we be more consistent in them? It seems odd that:
|
||
|
|
||
| // --- Access Logs Types --- | ||
|
|
||
| type AccessLogsConfig struct { | ||
| // Global switch to enable or disable access logging. | ||
| Enabled bool `json:"enabled"` | ||
|
|
||
| // The format of the logs (e.g., JSON, Text). | ||
| Format string `json:"format,omitempty"` | ||
|
|
||
| // Conditions for logging, allowing filtering to specific paths or events. | ||
| Matches []MatchCondition `json:"matches,omitempty"` | ||
|
|
||
| // A list of specific fields or headers to include in the logs. | ||
| Fields []string `json:"fields,omitempty"` | ||
| } | ||
|
|
||
| type MatchCondition struct { | ||
| // Path allows filtering to specific paths. | ||
| Path string `json:"path,omitempty"` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why bother having this if we already have CEL and its trivial to express in CEL?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way I understand this is that "Path" is a (convenient?) simpler shortcut for just logging vs expressing this in CEL. Probably if we were to leave it we should do it as a But it makes sense to remove and start with an API without it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CEL is an order of magnitude slower than native code... Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path) which will make it more confusing than having a structured condition.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe in some implementations. Should a general purpose vendor-agnostic API switch to a suboptimal user experience (lets assume here that CEL is a preferred UX, since if its a bad UX and bad performance it would obviously be a bad choice) because some implementations do not implement it optimally? We are handling these at ~native speeds in our implementation.
This is exactly why I would prefer CEL personally, so we don't have to make our YAML api have 5 variations just for path -- not to mention headers, query params, cookies we will have like 25 fields just to match headers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CEL is an interpreted language by design, so unless you're carving out a specific subset (or breaking some semantics) - it cannot match native speeds. In either of those cases, that is not vendor neutral, there's no guarantee that the "fast" subset of CEL is the same across implementations. YAML gives that for free. Every implementation will be similarly efficient, and all the "meta" tooling will support it without having to compile/run CEL.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW CEL is used throughout the other APIs in this repo already. IMO this project needs to decide on CEL or not, so we don't have to have this debate on each field. Its not good for users to have 50% of fields (that would make sense to use CEL) use CEL and 50% don't just because of who made the API, and its not great for us to have to debate it on each field usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, as someone who worked on CEL C++ runtime and made it open source - I'm telling you CEL is not the right paradigm for the "95%" of data path cases. It's too slow, and it prevents meta-tooling (management and control planes) from analyzing the semantics. It makes perfect sense as an "extensible context-aware condition" (e.g. like Wasm) for the rest 5% of tail-end bespoke cases, but you wouldn't use CEL for label selectors, would you? |
||
|
|
||
| // CEL provides an expression for advanced filtering (e.g., matching response codes, headers). | ||
| CEL string `json:"cel,omitempty"` | ||
| } | ||
| ``` | ||
|
|
||
| ### Alignment with Requirements | ||
|
|
||
| #### Agentic Telemetry | ||
|
|
||
| * **Token Counting**: The `metrics.overrides` and `accessLogs.fields` sections allow extracting the values from headers (e.g., `x-usage-input-tokens`, `x-usage-output-tokens`) or request/response bodies (if supported by the data plane) into telemetry. | ||
| * **Tool Use Auditing**: By attaching a `TelemetryPolicy` to a `Gateway` serving LLM traffic, operators can enforce 100% access logging for specific routes (e.g., `/tool/execute`) to create an immutable audit trail of agent actions. | ||
| * **Latency Tracking**: Latency histograms can be configured to track "Time to First Token" (TTFT) if exposed by the backend protocol. | ||
|
|
||
| #### Tracing | ||
|
|
||
| * **Sampling**: Supports probabilistic and parent-based sampling. | ||
| * **Propagation**: Explicitly configures propagation formats (W3C TraceContext defaults, option B3, Jaeger, etc.) | ||
| * **Customization**: Allows appending custom tags/attributes to spans. | ||
|
|
||
| #### Metrics | ||
|
|
||
| * **Granularity**: Users can enable/disable specific metric families. | ||
| * **Dimensions**: The API supports "overrides" (similar to [OpenTelemetry Views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view)) where users can add or remove dimensions (labels/attributes) to reduce cardinality or increase visibility. | ||
|
|
||
| #### Logging | ||
|
|
||
| * **Flexible Formatting**: Supports both JSON and text formats for compatibility with standard log aggregation stacks. | ||
| * **Smart Filtering**: Reduces noise and cost via CEL-based filtering, allowing logs to be generated only for specific events (e.g., 5xx errors, high latency, or critical paths). | ||
|
gkhom marked this conversation as resolved.
|
||
| * **Custom Attributes**: Enables the extraction of specific headers and proxy metadata into log entries. | ||
| * **Sinks**: Defaults to standard container logging (stdout) with extensibility for OTLP or external ports. | ||
|
|
||
| ## Comparison with Prior Art | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. N.B that all of these are wrappers around Envoy
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @howardjohn - can you offer other arts that are not relying on Envoy? I recall seeing a comment here or in slack welcoming help/contributions to the prior art section more
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ### Istio | ||
|
|
||
| [Istio](https://istio.io/)'s `Telemetry` API is the most direct prior art that inspired this proposal. It allows configuring observability at the mesh, namespace, and workload level. | ||
|
|
||
| * **Metrics**: Istio allows users to enable/disable specific metrics, add custom dimensions, and configure providers. | ||
| * **Logs**: Istio supports access logging configurations with CEL-like expressions for advanced filtering. | ||
| * **Traces**: Istio supports probabilistic sampling, context propagation, and custom span tags. | ||
| * **Customization**: For advanced telemetry use-cases not natively covered by the `Telemetry` API, Istio users can fall back to using `EnvoyFilter` resources. While highly flexible, `EnvoyFilter` requires deep knowledge of Envoy's internal xDS API. This is tightly coupled to the data plane implementation and can be brittle across version upgrades. | ||
| * **Comparison**: The proposed `TelemetryPolicy` adapts Istio's powerful intent-based capabilities to the standardized Gateway API attachment model. | ||
|
|
||
| ### Envoy Gateway | ||
|
|
||
| [Envoy Gateway](https://gateway.envoyproxy.io/) configures observability through two distinct custom resources: `EnvoyGateway` for the control plane and `EnvoyProxy` for the underlying data plane proxies. | ||
|
|
||
| * **Metrics**: Envoy Gateway allows configuring Prometheus and OpenTelemetry sinks for both the control plane (using `EnvoyGateway` CRD) and the data plane proxies (using the `EnvoyProxy` CRD). | ||
| * **Logs**: Proxy access logs are configured via the `EnvoyProxy` resource. It supports exporting to file, OTLP, or gRPC Access Log Service (ALS) sinks. It uses CEL expressions for smart filtering (e.g., matching specific headers), and allows applying log configurations at the Route or Listener level. | ||
| * **Tracing**: Tracing is configured in the `EnvoyProxy` resource. It supports OpenTelemetry, Zipkin, and Datadog providers. It allows configuring sampling and supports appending custom tags derived from literals, environment variables, or request headers. | ||
| * **Customization**: For advanced telemetry use-cases not covered natively, users can fall back to the `EnvoyPatchPolicy` API to mutate the underlying xDS configuration using JSON Patch semantics. This is similar to Istio's `EnvoyFilter`. | ||
| * **Comparison**: While Envoy Gateway provides a robust, native telemetry configuration, it is tightly coupled to infrastructure-oriented CRDs. The proposed `TelemetryPolicy` allows users to configure telemetry behaviors using a portable `targetRef` model, without binding their observability intent to an Envoy-specific schema. | ||
|
|
||
| ### Kuadrant | ||
|
|
||
| [Kuadrant](https://kuadrant.io/) provides observability for API management features like rate limiting and authentication. It is configured through a mix of its own custom resources and the underlying gateway's APIs. | ||
|
|
||
| * **Metrics**: Kuadrant enables metrics via the `Kuadrant` CR. It also introduces its own `TelemetryPolicy` API (extensions.kuadrant.io/v1alpha1) to add custom dimensions to metrics. | ||
| * **Logs**: For proxy access logging, Kuadrant relies on the underlying gateway provider (e.g., Istio's Telemetry API). However, it configures request correlation across its own components (Authorino, Limitador, and Wasm-shim) by specifying HTTP header identifiers in the `Kuadrant` CR. | ||
| * **Tracing**: Tracing is configured centrally via the `Kuadrant` CR. It exports OpenTelemetry spans for both the control plane and data plane components. It supports global trace filtering levels to control the verbosity of exported spans. | ||
| * **Customization**: To make low-level, custom modifications to the data plane configuration that are not supported by Kuadrant's native APIs, users can bypass Kuadrant and directly use the underlying gateway's mechanisms. | ||
| * **Comparison**: While Kuadrant provides powerful, identity-aware telemetry (like token tracking per user), its configuration is fragmented across the `Kuadrant` CR, components specific CRDs, its custom extension `TelemetryPolicy`, and the underlying gateway's native APIs. The proposed `TelemetryPolicy` unified these intent-based capabilities into a single, provider-agnostic resource. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to call it a "Direct Policy" while
GatewayandNamespaceas supported target kinds are for two completely disjoint use cases – ingress and mesh.By definition, it's only direct when:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more accurate to call it "inherited policy"?