Add Message Attestation for ServerToAgent messages#333
Conversation
Introduce an optional, opt-in, end-to-end integrity mechanism for every ServerToAgent message based on X.509 certificate chains. Trust is rooted in an operator-configured payload trust anchor that is distinct from the TLS CA pool, decoupling the OpAMP distribution server from the authoritative source of OpAMP messages. Wire-level changes (all Status: [Development]): * AgentCapabilities.RequiresPayloadTrustVerification = 0x00010000 * ServerCapabilities.OffersPayloadTrustVerification = 0x00000080 * ServerToAgent.trust_chain_response = 12 (new TrustChainResponse message) * ServerToAgent.signature = 13 The new "Message Attestation" section in specification.md covers the threat model, trust model, opt-in semantics, capability negotiation, connection-time handshake, in-session per-message verification, algorithm selection (any X.509-supported), certificate requirements, failure modes (all result in agent disconnect), and out-of-scope items. Strict opt-in: implementations that do not implement Message Attestation require no changes. Existing OpAMP deployments are unaffected until both sides opt in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Stanley Liu <stanley.liu@datadoghq.com>
One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via: Supported
Alternatively, if the co-author should not be included, remove the Please update your commit message(s) by doing |
| pre-configured trust anchor (a CA certificate) that is **distinct** from | ||
| the TLS certificate authority used to establish the transport. |
There was a problem hiding this comment.
| pre-configured trust anchor (a CA certificate) that is **distinct** from | |
| the TLS certificate authority used to establish the transport. | |
| pre-configured trust anchor (a CA certificate) that MUST differ from | |
| the TLS certificate authority used to establish the transport. |
you use the MUST wording later on
There was a problem hiding this comment.
I'm interested in drives this requirement for separate CAs? It would require a customer that deploys message attestation to maintain / utilise two distinct CAs which is a bit of a burden.
If it is to split the security domains between the TLS + attestation, then I think we get that already using the Key Usage constraints on the certificates themselves:
- A server using TLS requires the
serverAuthkey usage bit to be set. - This spec requires that the
codeSigningbit be set.
So this lets us have two distinct chain "roles" that can chain to the same root, but are isolated from each other. You could go as far as to have the codeSigning intermediate CA stored separately from the serverAuth intermediate, physically isolating them too.
| update or replace the Agent's payload trust anchor. This is a deliberate | ||
| constraint to prevent a compromised Server from rotating the Agent onto | ||
| an attacker-controlled trust anchor. |
There was a problem hiding this comment.
What's the guidance if the trust root is set to expire?
Is some external process needed in order to update?
There was a problem hiding this comment.
If a root is expired the assumption is there will be some manual or out-of-band redistribution of the new trust root certs to the clients. Baking in rotation of the root cert into the protocol would allow for attackers to hijack the entire attestation process.
| * If `trust_chain_response` is unset, the Agent MUST terminate the | ||
| connection. | ||
| * If `trust_chain_response.error_message` is non-empty, the Agent | ||
| MUST terminate the connection. |
There was a problem hiding this comment.
When an agent terminates, should it retry?
Do we want to use a specific error_message attribute instead of the ServerToAgent.error_response?
There was a problem hiding this comment.
IHMO retries (with an exponential backoff to prevent hurting the OpAMP servers) are OK; it would allow the clients to self-recover on some class of attacks, for example DNS-based, without having to do anything specific on the client-side (in our example, once the DNS issue is resolved).
I personally don't see a need to have a dedicated message attribute, but I don't have a strong opinion on my side.
tigrannajaryan
left a comment
There was a problem hiding this comment.
I am blocking this for now to prevent accidental merging.
This needs a thorough review.
I am also not sure we want this now, it is not aligned with anything we have on the roadmap: #321
If we make the call we want this in the spec, this is going to require a prototype demonstrating it (not needed for now) and we will need to decide when we want this to be added (possibly after 1.0).
domodwyer
left a comment
There was a problem hiding this comment.
Disclaimer: I provided technical help for the proposal.
I agree - I think getting a working prototype nailed down would really help shake out any issues 👍
I understand the roadmap to V1 is already being worked on so the timing isn't great, but I do think there's value in at least investigating what a more secure protocol might look like.
The primary concern with OpAMP as it proposed for V1, is that it's a security single-point-of-failure. If you compromise the distribution server (which necessarily sits in a risky network position, reachable by many machines) then it's game over: the attacker has full control of the fleet.
This is especially bad if you're a 3rd party provider like Datadog (my employer) as a single nginx exploit or similar could yield control of many thousands of machines across many distinct organisations. This kind of exposure becomes a significant business risk and isn't something that's easy to accept - it unfortunately fundamentally restricts our adoption of the open protocol we're otherwise big fans of!
We're definitely motivated to help develop any solutions 🙏
| pre-configured trust anchor (a CA certificate) that is **distinct** from | ||
| the TLS certificate authority used to establish the transport. |
There was a problem hiding this comment.
I'm interested in drives this requirement for separate CAs? It would require a customer that deploys message attestation to maintain / utilise two distinct CAs which is a bit of a burden.
If it is to split the security domains between the TLS + attestation, then I think we get that already using the Key Usage constraints on the certificates themselves:
- A server using TLS requires the
serverAuthkey usage bit to be set. - This spec requires that the
codeSigningbit be set.
So this lets us have two distinct chain "roles" that can chain to the same root, but are isolated from each other. You could go as far as to have the codeSigning intermediate CA stored separately from the serverAuth intermediate, physically isolating them too.
| state, but not the signing keys, can still alter OpAMP messages in | ||
| flight. | ||
|
|
||
| Message Attestation does **not** address: |
There was a problem hiding this comment.
So I think there's two other properties that this proposal doesn't currently address that we might want to consider - they're both part of a class of attacks we expect for our internal protocol to mitigate:
- Message replay: if an attacker obtains a valid and signed message, they can reuse that message and send it to the same agent again, or a different agent that was not the intended recipient.
- Rollback attack: related to the above, a previously captured message can be sent to revert any subsequent changes (like "roll back a config change").
A classic replay attack in this context would be capturing a validly signed ServerToAgentCommand.type = Restart message for an Agent, and then resending it every time that agent connects, causing a DoS of the Agent using the attested message.
A more subtle attack would be to use the above replay trick to obtain a legitimate remote config message that compels an Agent to upload sensitive files to the server1 or perform some similarly risky action, and then broadcast that command to all connected agents instead of just the intended recipient.
Notably an attacker can't change the content of these messages (that would invalidate the signature). They're constrained to only replaying messages previously sent by the server.
Fixes are relatively easy:
-
To prevent cross-session replay attacks, you can make use of a session nonce ("number used once") that is randomly generated by the Agent, unique to the connection and is part of the signed
ServerToAgentpayloads sent to that Agent. Upon receiving a payload, the ID is checked to match. That prevents a message destined for Agent A from being replayed to Agent B. -
To prevent intra-session replay attacks: maintain a sequence number such that an old message cannot be resent later in the session to roll back the effects of a message (e.g. cannot "undo" a "change config" request) - much like your car probably does. There's a similar mechanism already in place for
AgentToServermessages already, but for a different purpose.
Note: there is the instance_id field which goes a long way to preventing cross-Agent replay attacks, but the protocol allows the server to push an "instance ID change" request (ref) which could be used to switch the Agent to an instance ID that is part of a previously captured request. I'm not familiar enough with the implementation to know if this is a viable path, and if not, the existing instance ID may suffice as a session nonce, but it should be documented as a load-bearing part of the protocol security if used as such.
Footnotes
| The Server is independently configured with a signing key and its | ||
| corresponding certificate chain that validates back to the payload trust | ||
| anchor. |
There was a problem hiding this comment.
Is this paragraph implying that the server MUST validate payloads are valid before they are sent to the client?
If so, I would consider skipping this - it's not going to help you in the case of an attack (the attacker can simply send their message directly, bypassing the check) but it imposes a cost in the happy path as every message must be validated (latency + CPU).
|
|
||
| ### Certificate Requirements | ||
|
|
||
| The signing leaf certificate MUST: |
There was a problem hiding this comment.
Another constraint: the certificate must have a dNSName or iPAddress SAN entry that matches the hostname (or IP address if IPs are used) of the opamp distribution server.
This will be required by any X509 verification library - it is what is used to ensure you're using the right certificate for the server you're connected to.
| * **Per-message-type opt-out (signing allowlist).** Mechanisms by which | ||
| an Agent might accept some `ServerToAgent` message types unsigned — | ||
| for example, to allow a third-party fleet manager to push low-risk | ||
| read-only telemetry settings while still requiring authoritative | ||
| signatures for configuration or command messages — are deferred to a | ||
| follow-up specification. |
There was a problem hiding this comment.
I think this presents an interesting future development if justified by usage - being able to safely delegate parts of operating the fleet to 3rd parties or higher-risk infra, while retaining control over risk factors is a nice feature.
| 1. Construct the `ServerToAgent` message with all required fields set | ||
| except `signature`, which is left empty (zero-length). | ||
| 2. Serialise the message using deterministic Protocol Buffers encoding | ||
| (for example, Go's `proto.MarshalOptions{Deterministic: true}` or | ||
| the equivalent in other implementations). | ||
| 3. Sign the resulting byte string with the Server's signing private | ||
| key, using the signature algorithm declared by the leaf | ||
| certificate's `signatureAlgorithm` field. | ||
| 4. Set `signature` to the resulting signature bytes. |
There was a problem hiding this comment.
I think this is probably going to be problematic - the protobuf encoding does not have a canonical representation. The docs specifically state that, even in "deterministic output" mode:
The serializer can generate different output for many reasons, including but not limited to the following variations:
- The protobuf schema changes in any way.
- The application being built changes in any way.
- The binary is built with different flags (eg. opt vs. debug).
- The protobuf library is updated.
Which isn't going to provide sufficient guarantees when deployed across different versions / implementations / platforms / etc.
In general signing a payload that also contains the signature is definitely tricky and kind of forces this "serialise -> sign -> reserialise" dance that then requires the serialisation be deterministic on all platforms.
In an ideal world, the signature would be part of the metadata sent alongside the signed payload, rather than within the payload itself. For the HTTP transport that is easy - the signature can be a header value. But the WebSocket transport don't provide a metadata side channel like like HTTP headers, and the protocol doesn't currently have any provision for metadata alongside the message payload.
Here's some possible solutions - they're all shifting the signature out of the payload (to make a "detached signature") but there might be other ways to do this that I've not thought about.
Negotiated Type Switch
After the feature flag exchange has resulted in negotiating the use of signed messages, switch to using a different type instead of ServerToAgent if and only if message attestation is enabled:
message SignedServerToAgent {
bytes payload = 1; // Serialised ServerToAgent
bytes signature = 2;
}
// ServerToAgent unchangedClients will need to have a new check early in the message processing path, but then processing continues as usual:
payload := conn.Read();
var message ServerToAgent
if usingSigning {
message = verify_and_parse(payload);
} else {
message = parse_server_to_agent(payload);
}
// Continues like normalWhere verify_and_parse() parses the SignedServerToAgent protobuf message, and validates the signature, and then returns the inner ServerToAgent. The rest of the message processing continues like normal, using the ServerToAgent as it does today.
Pros:
- Protocol is fully backwards compatible with existing clients.
Cons:
- Protocol is now not fully described by a single type for the server -> agent direction.
Explicit Protocol Metadata
Obviously the protocol is not yet v1, so technically a breaking change could be made to support attestation directly - either by modifying the protobuf:
// Redefine the top-level type:
message ServerToAgent {
ServerToAgentPayload payload = 1;
//or bytes payload = 1;
// metadata here
bytes signature = 2;
}
message ServerToAgentPayload {
// Existing fields here
}This adds user friction though and probably wouldn't be well received, but it is a "one time" cost prior to stability.
Alternatively, by changing the WebSocket header to explicitly indicate that a signature is present alongside the encoded protobuf payload:
┌────────────┬────────────────────────────────────────┬───────────────────┐
│header │varint encoded unsigned 64 bit integer │1-10 bytes │
├────────────┼────────────────────────────────────────┼───────────────────┤
│data │Encoded Protobuf message, │0 or more bytes │
│ │either AgentToServer or ServerToAgent │ │
├────────────┼────────────────────────────────────────┼───────────────────┤
│metadata │serialised metadata │0 or more bytes │
└────────────┴────────────────────────────────────────┴───────────────────┘
They're both equivalent except the former couples the metadata to protobuf.
Pro:
- Explicit "bag of metadata" separates concerns from "payload content" in the same way HTTP / gRPC does and can be reused.
Cons:
- Breaking change - probably a difficult option at this point.
Response to PR open-telemetry#333 review feedback (discussion_r3235566207). The previous design signed an inline ServerToAgent.signature field by clearing the field, deterministically re-marshalling, and signing the result. That requires byte-identical "deterministic" output across every protobuf implementation and version, which proto explicitly does not guarantee (https://protobuf.dev/programming-guides/serialization-not-canonical/). This commit moves the signature and trust-chain delivery onto a new top-level envelope message: * SignedServerToAgent { bytes payload = 1; bytes signature = 2; TrustChainResponse trust_chain_response = 3; } * `payload` carries the marshalled bytes of an inner ServerToAgent; `signature` is a detached signature over those exact wire bytes (the Agent verifies without re-marshalling). The envelope is sent only when both peers have negotiated Message Attestation (RequiresPayloadTrustVerification + OffersPayloadTrustVerification). For all other connections the wire format is byte-identical to upstream OpAMP — ServerToAgent is untouched, and implementations that don't opt in see no wire change. Removes the trust_chain_response (field 12) and signature (field 13) inline fields added to ServerToAgent in the previous commit; both field numbers and names are now reserved on ServerToAgent. Spec narrative updated: rewrites Connection-Time Handshake and In-Session Signature Verification subsections; adds a new SignedServerToAgent Message subsection; adds a rationale paragraph citing the protobuf non-canonicality guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves #265, see proposal document for context.
Summary
This PR introduces an optional, end-to-end integrity mechanism for
ServerToAgentmessages based on X.509 certificate chains. When both the Server and the Agent opt in, everyServerToAgentmessage sent after the initial handshake carries a signature that the Agent verifies against a pre-configured trust anchor (a CA certificate) that is distinct from the TLS certificate authority used to establish the transport.The mechanism allows OpAMP deployments to separate the distribution server from the authoritative source of OpAMP messages. A compromised distribution server cannot, in this model, push arbitrary configuration or commands to an Agent, because every message must be signed by a key whose trust chain validates against the Agent's pre-configured root.
Strict opt-in: implementations that do not implement Message Attestation require no changes. Existing OpAMP deployments are unaffected until both sides opt in.
Changes
Wire-level changes (all Status: [Development]):
The new "Message Attestation" section in
specification.mdcovers the threat model, trust model, opt-in semantics, capability negotiation, connection-time handshake, in-session per-message verification, algorithm selection (any X.509-supported), certificate requirements, failure modes (all result in agent disconnect), and out-of-scope items.Client-facing changes
To enable verification of messages, the OpAMP client is configured with a root certificate to trust at startup via a config value:
When provided, the client will indicate to the server that it requires authenticated payloads, and use the provided root certificate to verify messages the server sends.
Motivation
The OpAMP Server is currently treated as both a point of distribution and the authoritative source of OpAMP messages. If an OpAMP Server is compromised, clients can be subject to significant security vulnerabilities such as forced restarts, breaches of sensitive information, and unintended binary updates. By requiring messages to be signed by an authoritative source, an attacker gaining access to the OpAMP Server would not be able to sign messages that do not meet strict validation/type requirements. OpAMP Agents will also have the capability to reject messages with invalid certificates, and can immediately terminate the connection, preventing further attacks.