TelemetryPolicy proposal#69
Conversation
This change include context, problem description, and design objectives for a TelemetryPolicy proposal. If the community agrees on this context then I will follow up with the actual API specification.
[WIP] TelemetryPolicy proposal
✅ Deploy Preview for kube-agentic-networking ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
|
Welcome @gkhom! |
|
Hi @gkhom. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@gkhom , can you fix this? |
|
/ok-to-test |
|
/easycla |
|
/check-cla |
There was a problem hiding this comment.
@gkhom Thanks for kicking this off. I left some clarifying questions and proposing a minor change. The bigger part of the review is to surface work already being proposed/done in the llm-d community with regard to tracing and whether it is applicable to our objectives.
| This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) | ||
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). | ||
|
|
||
| This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs |
There was a problem hiding this comment.
I am (acting as) a naive reader, and I was immediately curious what some examples of these vendor-specific CRDs are. This also ties to and might clarify the below mention of "Observability lock-in".
There was a problem hiding this comment.
Examples of such CRDs are:
- Istio's Telemetry CRD
- Envoy Gateway's EnvoyProxy and EnvoyGateway CRDs
- Kong's MeshMetrics/MeshTrace/MeshAccessLog
- Kuadrant's TelemetryPolicy
I intend to write a section that compares such existing APIs and the proposed TelemetryPolicy in the eventual proposal.
There was a problem hiding this comment.
Seeing as there's a mix of examples here, will the scope cover one resource for all of the signals (metrics, logs, traces) vs. separate ones? Are there tradeoffs to consider here?
There was a problem hiding this comment.
I'm leaning towards one resource for all. The argument for splitting them might be that different personas are involved in configuring the different aspects of observability. In practice, I think that the persona that configures metrics, likely also configures tracing and access logs. So to avoid complicating the API with three additional resources, it seems worthwhile to put all of it in a single resource.
| * **Cost is Volatile**: Usage is measured in tokens, not just requests. A single HTTP 200 OK could cost $0.01 or $10.00 depending on the prompt and model used. | ||
| * **Context is King**: Debugging requires knowing the semantic context: Which Model? Which Prompt? Which tool? | ||
|
|
||
| Existing telemetry policies are unaware of the Generative AI semantic conventions. They see an opaque TCP stream or HTTP POST. Without a standardized API to |
There was a problem hiding this comment.
In line with the header, may I suggest adding "unaware of the emerging Generative AI semantic conventions"?
|
|
||
| 1. **Standardization**: A single API for Gateway and Mesh to configure Access Logging, Metrics generation, and Tracing propagation. | ||
| 2. **GEP-713 Compliance**: Support `targetRef` attachment to `Gateway` and `Namespace`. The latter covers Mesh use-cases. | ||
| 3. **Agentic Support**: Enable the capture of OpenTelemetry GenAI Semantic Conventions and support the requirements of PR #33. |
There was a problem hiding this comment.
In the spirit of standardization and not reinventing the wheel, I wanted to mention that the llm-d community is already moving on tracing + OTel + GenAI Semantic Conventions. In particular, Sally O'Malley from Red Hat proposed and did a POC for distributed tracing in llm-d. [aside: I learned about this work from Sally on another community call for our kagenti project]
This may be applicable here for a few reasons:
- We are keen on integrating OTel and GenAI semantic conventions, too
- One of our objectives is a single API for Gateways and Meshes, and Sally's POC has already landed some changes to support tracing to the Gateway API Inference Extension (GAIE) components like the endpoint pickers (proposal comment, GAIE PR).
While llm-d is focused on distributed LLM inferencing regardless of source (i.e., user chat -> LLM vs agent -> LLM), I think it's worth considering any lessons they may have already encountered and API definitions that could overlap with our case, at the very least at the Gateway level. I'd be willing to evangelize our thinking to Sally to get her thoughts, but more importantly curious on our interest level.
There was a problem hiding this comment.
It would certainly be valuable to get some of their insights and experiences. The proposal seems to cover configuration through environment variables, have they defined CRDs as well?
There was a problem hiding this comment.
No, I did not see any CRD definitions. I'll keep this thread in mind as the definitions become more concrete.
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). | ||
|
|
||
| This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs | ||
| with a unified, portable spec. |
There was a problem hiding this comment.
Would you see implementations reconciling the TelemetryPolicy and reading the bits that are relevant to their components? So multiple controllers read the CR and take actions to enable telemetry across the components they are controlling?
There was a problem hiding this comment.
It is indeed possible to distribute the responsibility across multiple controllers, it's up to the implementation. In most cases that I'm familiar with a single controller/control plane programs all three observability features (metrics, traces, logs).
There was a problem hiding this comment.
possible but a little challenging, what are the cases we see multiple impls reconcile the same thing?
|
I'm generally in favour of this proposal. Perhaps more relevant when it comes to the specification, it would be good to know more about the current 'state of the art' in this space. |
| matches: # Conditional logging | ||
| - cel: "response.code >= 500" # CEL-based filtering for errors |
There was a problem hiding this comment.
If we only want CEL its a bit awkward to have a list. But may make sense if we have non-CEL
| fields: # Configure specific fields to include | ||
| - "start_time" | ||
| - "response_code" | ||
| - "x-token-usage" |
There was a problem hiding this comment.
is there any definition of what these fields mean?
For a concrete example, say I want to log the MCP task name (I chose this since its not in https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/). Can I do it? what do I put as the field if I want to?
| type Dimension struct { | ||
| Key string `json:"key"` | ||
| FromHeader string `json:"fromHeader,omitempty"` | ||
| } |
There was a problem hiding this comment.
The 3 APIs all have the same property of "add a K/V" pair but do so in 3 different ways. Does it make sense? Should we be more consistent in them?
It seems odd that:
- tracing: literal only
- metrics: header only
- log: a name only without a value
|
Can we move this discussion to Gateway API as a provisional GEP first, and then experimental? I think a lot of the discussion here is happening on a context that is important not only for agentic, but for the whole Gateway API ecosystem and I wouldn't like to receive this proposal as "we discussed and approved on agentic and now this needs to be implemented this way on Gateway API". Thanks! |
|
Seconding @rikatz's comment – this seems applicable to much more than just the agentic world, and I'd love to get eyes on it from Gateway API. Thanks!! 🙂 |
|
@rikatz @kflynn have you had a chance to also review the API? There has been some really good iteration here. The fact that it probably belongs to Gateway is not questionable and have been discussed multiple times. Can you both review here, so we have the comments all cohesively in one place and we can move it to Gateway to the last round? Alternatively if any of you know of a way to move the proposal to with all the comments and history to Gateway then it will also be great. I would like to make sure Agentic usecases are represented, for this group this is the main motivation and its beyond "telemetry is useful to standardize generic cases as well" -- for agentic its an increasingly important feature and we need to make sure we iterate fast here and have the agentic cases are meaningfully represented. |
|
No, I didn't mostly because I didn't knew it existed and heard about it recently on EGADS. I will, but again, I will be strongly against this getting merged to Gateway API if this is merged to this repo first. I think it is a different audience, and the initial part of the GEP already says it is on the wrong place. Having all of these discussions here will make them be lost from the official GEP process. |
| 2. Support emerging workloads like AI Agents, which require specialized metrics (e.g., token usage, model latency) and detailed audit logs for tool-use verification. | ||
| 3. Manage “Mesh” and “Gateway” observability with a single unified API. | ||
|
|
||
| ## The Emergence of Agentic Networking |
There was a problem hiding this comment.
do we really need a callout for this here, given that in the rest of the document is just specified that there is a need for a Telemetry API standardization, regardless of being Agentic Networking or not?
|
|
||
| ## The TelemetryPolicy Specification | ||
|
|
||
| We propose the `TelemetryPolicy` as a direct policy attachment in the `agentic.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment. |
There was a problem hiding this comment.
I think this API is too generic to belong to this specific group, maybe start with x-k8s.io?
| accessLogs: | ||
| enabled: true | ||
| matches: # Conditional logging | ||
| - cel: "response.code >= 500" # CEL-based filtering for errors |
There was a problem hiding this comment.
One thing I am missing on this proposal is what is required and what is extended. Given this may become a Gateway API specification, we should at least define what is expected to be a core feature or not for this API.
eg.: The CEL matching here may not be implementable by all shippers.
|
|
||
| ## The TelemetryPolicy Specification | ||
|
|
||
| We propose the `TelemetryPolicy` as a direct policy attachment in the `agentic.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment. |
There was a problem hiding this comment.
Also, WHY a policy attachment. Why are you chosing this approach instead of inline Gateway configuration? Can you be more explicit on your rationale behind this decision?
|
No explicit requirement for:
Recommend first-class “authorization trace graph” for debugging + compliance. |
|
|
||
| The following is an example that demonstrates the structure of the `TelemetryPolicy`. | ||
|
|
||
| ```yaml |
There was a problem hiding this comment.
can we also have the status specification please?
| overrides: | ||
| - name: "gateway.networking.k8s.io/http/request_count" | ||
| type: Counter | ||
| dimensions: # Custom labels/dimensions |
There was a problem hiding this comment.
this part of the API is slightly confusing and seems somehow like a Prometheus rewrite rule. Is this something supported by OTEL, or some assumption that the Gateway implementation will do the rewrites/overrides here?
|
|
||
| type TelemetryPolicySpec struct { | ||
| // Identifies the target resources (Gateway or Namespace) to which this policy attaches (GEP-713). | ||
| TargetRefs []NamespacedPolicyTargetReference `json:"targetRefs"` |
There was a problem hiding this comment.
can I attach the same policy to both a Gateway and a namespace? what happens? Is there a precedence? How would the status look like in the case of a namespace attachment?
| type TracingConfig struct { | ||
|
|
||
| // Global switch to enable or disable tracing. | ||
| Enabled bool `json:"enabled"` |
There was a problem hiding this comment.
Do not use bools on APIs. It is bad if you need something more than true/false (eg.: if you want a new Enabled semantic that means "partialEnable").
Please consider a different way to sate that this config is enabled or not (eg.: if tracingProvider is nil or not)
There was a problem hiding this comment.
(this comment applies to any usage of bool on this API)
| // | ||
| // +required | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MaxItems=16 |
There was a problem hiding this comment.
so the targetRef is also limited to 16?
|
Folks, I think it's very important to be clear. This is your subproject, and you can do what you want. But if you want to be able to take this upstream to Gateway API, you must use the Gateway API GEP process, including the Provisional step. We will not be accepting a full API without full justification, using the full process. So if you design and implement an API without using the Gateway API process, you're risking that this will never end up upstream. And none of us wants that, this is a useful thing to have configured. However, this reads like an attempt to force the Gateway API maintainers to do what you want by bring a completed API and implementations to us, and then saying "why are you being so difficult". Maybe it is, maybe it isn't, but that's what it looks like. We have very good reasons to be wary of adding Policy objects to Gateway API. They are very difficult to get right, and I have huge concerns with using the same Policy object for two very different targets. This Policy is also clearly useful to other usecases aside from Agentic Networking, and so must be designed for those use cases as well. |
|
The conversation is getting heated without a reason. @youngnick whats the concrete concern? That the template is not Gateway template verbatim? There are good comments from Ricardo about some missing sections - like why policy and a few others. And we are in agreement that we move this proposal to Gateway (was also discussed in EGADS iiuc). I am ooo so was slow to respond on gh but chatted with Ricardo on slack this morning and confirmed that as well. Regarding different layers @youngnick, I agree with you. It does look like starting with Gateway is more simple and more useful, at least for the time being. But happy to hear your thoughts as well. Agentic Net has concrete needs (and it will continue to have such), a lot of the work here will intersect with Gateway, some will apis will intersect more and some will intersect less. It's ok to have the "is Gateway the right home for this api" discussions and its part of the process. Anyways, i promised to move that to Gateway EOW. |
|
I've given this a better read, and here's my thoughts, more specifically: tl;dr This Policy, as written, is not suitable for upstreaming into Gateway API as it stands. It solves a problem that exists for most if not all Gateway API users, but builds the agentic parts into the core API. I'd much, much rather see a solution that allows for configuration of a generic telemetry config across HTTP requests (that is, requests where the Gateway implementation has access to an unencrypted HTTP stream), with an extension point (that can be pre-filled with agentic extensions, sure). I'd also question, again, why it needs to be a Policy. The document assumes that the only way to do this is with Policy, which is true when the code lives in this repo, under this subproject, but is not the case upstream in Gateway API. If we want to configure all the HTTP traffic passing through a Gateway, why not add a top-level struct on the Gateway to do that? We've already done that recently for TLS, and telemetry is a core enough requirement that it makes sense. Maybe it's better to have the sink parameters and other definitely-global config at the Gateway level, and extend the fiddlier bits with something else. But, that conversation is not really possible when we're not having it in the Gateway API process. So, I was a little forceful before, but I am telling you all that what will definitely not happen is building something here, and then bringing that thing upstream to Gateway and expecting it to get merged unchanged. I also agree that if you want to target Mesh use cases, then the Mesh resource is probably the right thing to target - but again, that is a thing where a Policy is not required, because once the proposal is in upstream Gateway API, we can inline the fields inside the Mesh resource. I would also strongly recommend that if you must have a Policy, and are going to allow this Policy to attach to multiple kinds of target, you carefully consider interaction and conflict management. What happens when there are two Policies targeting the same thing? What happens when a Gateway is in a namespace that is managed by a mesh? Do both configs take effect? If not, which one wins? @guicassolato mentions a merge strategy of I also looked over @rikatz's API review, and agree with his comments. In particular, |
|
I have also reread and broadly agree with @youngnick and @rikatz. I'm not going to say much about specific details, because to me those are wildly overwhelmed by the fact that GEP-713 policy is very much not my first choice for how to approach this -- and in my mind, that immediately brings us back around to questions about upstream Gateway API and its processes. Many of y'all participate in Gateway API pretty actively, but for the benefit of those who don't: if you want to add something to Gateway API - and my sense is that many or most of y'all would like this to be part of Gateway API itself - you start with a Provisional GEP. Provisional GEPs aren't about the details of the API, they're about the big questions: what problems are we trying to solve? why are they important to solve? for whom are we designing the solutions? These are critical, and they're rooted in explorations of user stories rather than API design -- but questions like "should we use GEP-713 policy for this?" are wrapped up in those questions, too. So the first thing I'm saying here is that given that feeling that there seems to be consensus around moving into Gateway API, a Provisional GEP feels like a better next step than anything else -- because the other thing is that without the Provisional GEP, there isn't a way into Gateway API, and if someone were to present an API design, we would need to back up to the Provisional step where everything is back on the table anyway, because though many of y'all may already have the answers to what, why, and who in your head, the Gateway API folks don't, and will need it explained. Additionally, as for why GEP-713 isn't my first choice... GEP-713 is really a way of coping with the fact that we can't extend core Kubernetes resources like Service. When we're dealing with resources we do control - like Gateways or Routes - GEP-713 should never be the first thing to reach for. Policies are deceptively easy to create, but they introduce really nasty questions (as @youngnick already mentioned) and every last one we accept as a part of Gateway API itself comes with the burden of supporting it - with all its issues - forever. 🙁 That's why you're seeing pushback there from me, and from others. |
|
For avoidance of doubt, a GEP is/will be opened in gateway-api for this. /remove-lgtm Keeping open for any remaining agentic specific discussion to conclude before closing. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gkhom The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Wanted to chime in since I don't see NGINX considered yet. We currently support OTEL tracing through an ObservabilityPolicy. This enables tracing for a Route. We also have a global nginx CRD where a user would set their exporter URL for the Gateway. This design was originally so a user didn't have to set this value on every single policy, and instead just set it globally. This CRD also allows for setting global span attributes for the Gateway. Here is our tracing document on how we set everything up. We also have considerations for future enhancements to our own module. |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR contains a proposal for a new
TelemetryPolicyAPI. This K8s API aims to standardize how users enable and configure telemetry (metrics, logs, traces) across different data plane implementations, replacing vendor-specific CRDs with a unified, portable spec.Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: