Skip to content

feat: Replace static musl build on alpine with glibc based build on gcr distroless#2919

Merged
lquerel merged 18 commits into
open-telemetry:mainfrom
JakeDern:glibc-image
May 18, 2026
Merged

feat: Replace static musl build on alpine with glibc based build on gcr distroless#2919
lquerel merged 18 commits into
open-telemetry:mainfrom
JakeDern:glibc-image

Conversation

@JakeDern
Copy link
Copy Markdown
Contributor

Change Summary

This PR changes our standard df_engine image from being a static build based on musl + alpine to being based on glibc on gcr distroless.

Part of this change requires updates to the mount paths for orchestrator config files as the home directory is now nonroot instead of dataflow.

What issue does this PR close?

How are these changes tested?

To test this, I did local runs of all the nightly/continuous/comparison dashboard suites using the new image + mount path changes.

I did my best to test cross compiling with the new targets for ARM as well:

image

Are there any user-facing changes?

Yes, the runtime image has been changed for the repo dockerfile and that comes along with some new expectations for mount paths.

@JakeDern JakeDern requested a review from a team as a code owner May 10, 2026 20:59
@github-actions github-actions Bot added rust Pull requests that update Rust code ci-repo Repository maintenance, build, GH workflows, repo cleanup, or other chores labels May 10, 2026
Copy link
Copy Markdown
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lquerel lquerel enabled auto-merge May 11, 2026 01:52
Comment thread .github/workflows/dataflow-engine-binary-size.yml Outdated
Copy link
Copy Markdown
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this. I authored #2214, and I think this is the better fix overall. #2214 fixed the immediate musl build issue. This PR avoids that problem completely by moving the image to glibc + distroless, which feels cleaner long term.

I left one small comment about the binary size workflow, but overall this direction looks good to me.

Comment thread rust/otap-dataflow/Dockerfile Outdated
@JakeDern
Copy link
Copy Markdown
Contributor Author

I ran a little experiment and indeed after this PR we use a huge amount of memory to build compared to before - Now the question is why:

  ┌─────────────────────────┬───────────┬──────────────────┐
  │         Branch          │ Wall time │ Peak memory used │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ main (9a664bf71)        │ 6m 30s    │ 4.85 GiB         │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ glibc-image (1165f5dbf) │ 8m 22s    │ 18.51 GiB        │
  └─────────────────────────┴───────────┴──────────────────┘

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.93%. Comparing base (9a664bf) to head (17dcbaf).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2919      +/-   ##
==========================================
- Coverage   85.93%   85.93%   -0.01%     
==========================================
  Files         725      725              
  Lines      275782   275782              
==========================================
- Hits       236993   236988       -5     
- Misses      38265    38270       +5     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 87.06% <ø> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 89.57% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lquerel lquerel added this pull request to the merge queue May 18, 2026
Merged via the queue into open-telemetry:main with commit f562f90 May 18, 2026
86 checks passed
@lalitb
Copy link
Copy Markdown
Member

lalitb commented May 18, 2026

I ran a little experiment and indeed after this PR we use a huge amount of memory to build compared to before - Now the question is why:

  ┌─────────────────────────┬───────────┬──────────────────┐
  │         Branch          │ Wall time │ Peak memory used │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ main (9a664bf71)        │ 6m 30s    │ 4.85 GiB         │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ glibc-image (1165f5dbf) │ 8m 22s    │ 18.51 GiB        │
  └─────────────────────────┴───────────┴──────────────────┘

My wild guess is that the final link step is more memory hungry for glibc as compared to musl. Probably can validate by adding temporary memory sampling around the cargo build --release step and checking whether the peak happens near the final rustc/linker invocation.

@JakeDern
Copy link
Copy Markdown
Contributor Author

JakeDern commented May 19, 2026

I ran a little experiment and indeed after this PR we use a huge amount of memory to build compared to before - Now the question is why:

  ┌─────────────────────────┬───────────┬──────────────────┐
  │         Branch          │ Wall time │ Peak memory used │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ main (9a664bf71)        │ 6m 30s    │ 4.85 GiB         │
  ├─────────────────────────┼───────────┼──────────────────┤
  │ glibc-image (1165f5dbf) │ 8m 22s    │ 18.51 GiB        │
  └─────────────────────────┴───────────┴──────────────────┘

My wild guess is that the final link step is more memory hungry for glibc as compared to musl. Probably can validate by adding temporary memory sampling around the cargo build --release step and checking whether the peak happens near the final rustc/linker invocation.

Removing the config.toml overrides for the linker brought it back down. I kept them originally and changed them to -gnu because I wasn't sure why they were in there, but then took them out as an experiment which was successful. My guess is that specifying those overrides clobbers more settings/flags than it appears and that somehow influences the memory usage during link time

pull Bot pushed a commit to thompson-tomo/otel-arrow that referenced this pull request May 19, 2026
…-otelcol nightly suite (open-telemetry#3032)

The nightly Pipeline Performance Tests run
[26069747923](https://github.com/open-telemetry/otel-arrow/actions/runs/26069747923/job/76648381259)
failed in the "Run syslog TCP performance test otelcol log suite" step
because the backend (`df_engine`) container never became ready
(readiness check timed out after 10 attempts).

Root cause: PR open-telemetry#2919 switched the `df_engine` image from musl/alpine to
glibc/distroless, changing the in-container home directory from
`/home/dataflow` to `/home/nonroot`, and updated the volume mount path
in all then-existing nightly docker yamls. The
`syslog-tcp-otelcol-docker.yaml` suite (added concurrently in open-telemetry#2962) was
missed, so it was still mounting the backend config at
`/home/dataflow/config.yaml` — a path that doesn't exist in the new
image — causing the backend container to fail to start.

This one-line change brings it in line with the other nightly suites.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-repo Repository maintenance, build, GH workflows, repo cleanup, or other chores rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Replace static musl build + alpine image with glibc build + distroless gcp image

6 participants