Skip to content

feat: add v1-registry-sync dry-run tool for syncing V2 data into V1 entries#467

Merged
vitorvasc merged 9 commits into
open-telemetry:mainfrom
Rama542:feat/465-v1-registry-sync
May 18, 2026
Merged

feat: add v1-registry-sync dry-run tool for syncing V2 data into V1 entries#467
vitorvasc merged 9 commits into
open-telemetry:mainfrom
Rama542:feat/465-v1-registry-sync

Conversation

@Rama542
Copy link
Copy Markdown
Contributor

@Rama542 Rama542 commented May 13, 2026

Summary

This PR adds a new ecosystem-automation/v1-registry-sync Python package as a
first step toward issue #465.

The tool reads the latest V2 registry snapshot from ecosystem-registry/collector/
and generates a report showing exactly which stability, display_name, and
description values would be written into matching V1 entries under
opentelemetry.io/data/registry/. For stability, it selects the most stable signal
level across all signals per component (stable > beta > alpha > development >
deprecated > unmaintained).

This version runs in dry-run mode only. The actual write step to opentelemetry.io
would require coordination with the opentelemetry.io maintainers on the V1 schema
and PR workflow, and is left for a follow-up.

What was added

  • models.py: ComponentSyncData and V1SyncReport dataclasses
  • reader.py: reads V2 YAML files and extracts fields for V1 sync
  • reporter.py: serializes the report to JSON or YAML
  • main.py: CLI entry point (uv run v1-registry-sync)
  • 18 unit tests covering the reader and reporter modules

How to run

From the repo root:
uv run v1-registry-sync --distribution contrib --format json

Closes #465

…ntries

Adds a new ecosystem-automation/v1-registry-sync Python package that reads
the latest V2 registry snapshot and generates a report showing which
stability, display_name, and description values would be written into
matching V1 entries under opentelemetry.io/data/registry/.

The tool runs in dry-run mode only for now and outputs a JSON or YAML
report. It selects the most stable signal level across all signals for
each component (stable > beta > alpha > development > deprecated >
unmaintained) and omits null fields from the output.

18 unit tests cover the reader and reporter modules.

Closes open-telemetry#465
@Rama542 Rama542 requested review from a team as code owners May 13, 2026 11:20
@netlify
Copy link
Copy Markdown

netlify Bot commented May 13, 2026

Deploy Preview for otel-ecosystem-explorer ready!

Name Link
🔨 Latest commit 72eaa99
🔍 Latest deploy log https://app.netlify.com/projects/otel-ecosystem-explorer/deploys/6a0ae47c61a23e00088283e4
😎 Deploy Preview https://deploy-preview-467--otel-ecosystem-explorer.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new v1-registry-sync Python workspace package that reads the latest V2 collector registry snapshot and emits a dry-run JSON/YAML report of the stability, display_name, and description values that would be synced into the matching V1 entries on opentelemetry.io. No writes to V1 are performed in this PR.

Changes:

  • New v1-registry-sync workspace package with models, reader, reporter, and main modules; per-component "most stable" stability is selected via a fixed priority list (stable > beta > alpha > development > deprecated > unmaintained).
  • CLI entry point v1-registry-sync with --inventory-dir, --distribution, --output, and --format flags.
  • 18 unit tests for reader and reporter using a synthetic V2 registry layout.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Registers the new v1-registry-sync package in the uv workspace.
ecosystem-automation/v1-registry-sync/pyproject.toml Project metadata, deps (PyYAML, semantic-version), and CLI script entry.
ecosystem-automation/v1-registry-sync/src/v1_registry_sync/init.py Apache header for the package module.
ecosystem-automation/v1-registry-sync/src/v1_registry_sync/models.py ComponentSyncData / V1SyncReport dataclasses and stability priority list.
ecosystem-automation/v1-registry-sync/src/v1_registry_sync/reader.py Locates latest v{version} dir per distribution and parses component YAMLs.
ecosystem-automation/v1-registry-sync/src/v1_registry_sync/reporter.py Writes the report to a stream as JSON or YAML.
ecosystem-automation/v1-registry-sync/src/v1_registry_sync/main.py argparse CLI entry point with logging.
ecosystem-automation/v1-registry-sync/tests/init.py Apache header for tests package.
ecosystem-automation/v1-registry-sync/tests/test_reader.py Tests for stability priority, latest-version selection, and parsing.
ecosystem-automation/v1-registry-sync/tests/test_reporter.py Tests for JSON/YAML report serialization and proposed-changes filtering.

@lucacavenaghi97
Copy link
Copy Markdown
Member

Hi @Rama542, a few things worth checking before this lands.

1. stability is not in the V1 schema

opentelemetry.io/data/registry-schema.json has no stability property and declares additionalProperties: false at the top level. The validator in gulp-src/validate-registry.js would reject any V1 entry containing stability, so the "stability": "<level>" value the report currently emits in proposed_v1_changes would not be applicable today as-is.

2. The report does not say which V1 entry would be updated

Each entry reads as {"name": "kafkareceiver", "proposed_v1_changes": {...}} but does not include the target V1 file path or whether a matching V1 entry exists. That is the piece a follow-up writer would need to act on, and it is also where most of the design difficulty of the sync lives (matching across naming conventions, handling renames and deprecations, skipping third-party distribution entries).

It could be worth adding a target_v1_file plus v1_entry_exists flag to each entry, so the dry-run becomes more directly actionable.

3. Could reuse InventoryManager from watcher-common / collector-watcher

watcher-common exposes BaseInventoryManager with list_versions, list_release_versions, list_snapshot_versions. collector-watcher.InventoryManager extends that with the distribution axis and load_versioned_inventory(distribution, version), which already does what _find_latest_version + the per-type yaml loop in reader.py does. The same pattern is used by configuration-watcher, java-instrumentation-watcher, and (as a non-watcher consumer) explorer-db-builder.

Picking up list_release_versions(distribution) instead of the local helper would also fix a latent issue: _find_latest_version currently returns the highest semver-sorted directory name including v0.152.1-SNAPSHOT, which means the tool can pick up unreleased data. list_release_versions excludes snapshots.

Smaller notes

  • display_name would overwrite V1 title. From a 30-component sample most divergences are cosmetic (V1 adds a "Collector" infix), but a handful of entries lose real information (e.g. otelarrowexporter: V1 says "OpenTelemetry Protocol with Apache Arrow Exporter", V2 says "OpenTelemetry Arrow Exporter"). Limiting the initial sync to description might be safer.
  • No README.md under v1-registry-sync/ yet; the sibling watcher packages have one.

Rama542 added 2 commits May 14, 2026 11:10
- Remove stability from proposed_v1_changes: the V1 schema declares
  additionalProperties false and has no stability field, so the
  validator would reject any entry containing it
- Remove title/display_name from proposed_v1_changes: a handful of V1
  titles carry more information than the V2 display_name (e.g.
  otelarrowexporter), so limiting the initial sync to description
  avoids losing fidelity
- Add target_v1_file and v1_entry_exists to each report entry so
  the dry-run output is directly actionable
- Replace local _find_latest_version and _parse_component_file with
  InventoryManager from collector-watcher, which is the same pattern
  used by explorer-db-builder and configuration-watcher; this also
  fixes a latent issue where the old helper could pick up SNAPSHOT
  directories since it sorted all version dirs including pre-releases
- Add --v1-registry-dir CLI argument to enable v1_entry_exists checks
  against a local clone of opentelemetry.io/data/registry
- Add README.md to match sibling watcher packages
@lucacavenaghi97
Copy link
Copy Markdown
Member

lucacavenaghi97 commented May 14, 2026

Six changes look good as applied!

One follow-up on target_v1_file. Current f"collector-{name}.yml" builds paths like collector-kafkareceiver.yml, but actual V1 files are collector-{component_type}-{slug}.yml. So v1_entry_exists is returning false for essentially every component even though most do exist in V1.

In my opinion the cleanest fix would be to join on the Go module path, which both registries already carry:

  • V1 stores it explicitly as package.name (e.g. github.com/open-telemetry/opentelemetry-collector-contrib/receiver/kafkareceiver)
  • V2 carries the same value implicitly: github.com/open-telemetry/opentelemetry-collector-contrib/{component_type}/{name}

Building a dict[go_module_path -> v1_file_path] once at startup by reading the package.name field of each V1 file, then looking up the expected path per V2 component would side-step the naming inconsistencies entirely. Across the 249 contrib components in v0.151.0, 244 match this way; 5 are genuinely missing in V1 (azurefunctionsreceiver, googlesecopsexporter, drainprocessor, spanpruningprocessor, datadogconnector, all recent additions). Those misses would be real signals rather than matcher bugs.

Including expected_go_module_path on every row in the report (matched or not) might also help triage when a match fails. Happy to discuss other approaches if you see a simpler path.

One related note: the existing test_target_v1_file_follows_naming_convention test asserts collector-fooreceiver.yml for fooreceiver, so it currently passes against the wrong convention. Worth replacing with a fixture that mirrors a realistic V1 naming if you go this route.

Rama542 and others added 2 commits May 15, 2026 09:12
The previous target_v1_file used f"collector-{name}.yml" but actual V1
files follow collector-{component_type}-{slug}.yml, so v1_entry_exists
was returning false for nearly every component.

The fix builds a dict[go_module_path -> v1_filename] at startup by
reading the package.name field from each V1 file. Each V2 component's
expected module path is constructed as:
  github.com/open-telemetry/opentelemetry-collector-contrib/{type}/{name}

Matching on the module path is consistent across both registries and
avoids naming-convention guesswork. Across 249 contrib components in
v0.151.0, 244 match this way; the 5 that do not (azurefunctionsreceiver,
googlesecopsexporter, drainprocessor, spanpruningprocessor,
datadogconnector) are genuinely missing from V1, not matcher bugs.

expected_go_module_path is also included on every report row so misses
are easy to triage. The test fixture now uses realistic V1 file names
(collector-receiver-fooreceiver.yml) instead of the old wrong convention.
@Rama542
Copy link
Copy Markdown
Contributor Author

Rama542 commented May 15, 2026

Good catch, thank you. The naming convention approach was broken because actual V1 files follow collector-{component_type}-{slug}.yml, not collector-{name}.yml, so v1_entry_exists was returning false for almost everything.

Switched to matching on Go module path instead. At startup the tool now reads every .yml file in --v1-registry-dir and builds a {package.name -> filename} index. Each V2 component's expected module path is constructed as github.com/open-telemetry/opentelemetry-collector-contrib/{component_type}/{name} and looked up in the index to find the actual V1 file. expected_go_module_path is included on every row so misses are easy to triage.

I also updated the test fixture to use a realistic V1 filename (collector-receiver-fooreceiver.yml) so the test covers the actual matching logic rather than the old wrong convention.

Copy link
Copy Markdown
Member

@lucacavenaghi97 lucacavenaghi97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go module path matching reads as I'd expect, and all the follow-up points from both rounds are now addressed. Approving.

Worth merging main into the branch before this lands. It's a few commits behind.

@vitorvasc vitorvasc enabled auto-merge May 18, 2026 10:06
@vitorvasc vitorvasc added this pull request to the merge queue May 18, 2026
Merged via the queue into open-telemetry:main with commit 92929f8 May 18, 2026
15 checks passed
@otelbot
Copy link
Copy Markdown
Contributor

otelbot Bot commented May 18, 2026

Thank you for your contribution @Rama542! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: sync stability and description from V2 registry into existing V1 entries

4 participants