Skip to content

Iceberg: read PartitionStatisticsFile so last_updated_at / last_updated_snapshot_id survive expire_snapshots #73399

@eshishki

Description

@eshishki

Enhancement

Problem

For Iceberg tables, last_updated_at and last_updated_snapshot_id per partition are not stored in any manifest — Iceberg derives them on the fly by walking ManifestEntry.snapshot_id() for every live data/delete file in a partition and looking up Snapshot.timestampMillis() from TableMetadata.snapshots().

This works only as long as the snapshots that originally wrote each partition are still in TableMetadata.snapshots(). After expire_snapshots (or expire_older_than) drops those snapshots:

  • Snapshot snap = table.snapshot(entry.snapshotId()) returns null for the expired writers
  • last_updated_at / last_updated_snapshot_id for those partitions become unrecoverable from manifests alone

Iceberg introduced PartitionStatisticsFile (referenced from TableMetadata.partition-statistics) precisely to persist these per-partition counters — including last_updated_at and last_updated_snapshot_id — across snapshot expiration. Engines that read it can keep reporting correct values; engines that don't, lose data after the first expire_snapshots.

Impact in StarRocks today

StarRocks currently never reads PartitionStatisticsFile. After expire_snapshots on an Iceberg table:

  1. SELECT * FROM <iceberg_table>$partitionslast_updated_at and last_updated_snapshot_id columns become NULL (or 0 / -1 depending on path) for any partition whose writing snapshot has been expired. Users querying this metadata table for ops/audit see incomplete data.

  2. MV auto-refreshIcebergCatalog.getPartitions populates Partition.modifiedTime from the same manifest walk. DefaultTraits.getUpdatedPartitionNames compares modifiedTime against MV.BasePartitionInfo.lastRefreshTime to decide which partitions need refresh. After expire:

    • Partitions with expired writers get modifiedTime = 0 (or earlier wall-clock than the MV refresh) → MV treats them as "not updated" and skips them even if their data actually changed.
    • In the opposite direction, any refresh strategy that hashes manifest contents (e.g. getPartitionVersion) sees dvCount / position_delete_* shift after table maintenance (rewrite_files, DV compaction) and over-refreshes partitions whose logical data is unchanged.

Combined effect: on long-lived Iceberg tables with periodic expire_snapshots, the MV refresh decision is wrong in both directions, and $partitions shows wrong audit data.

Proposed solution

Teach the FE Iceberg connector to read PartitionStatisticsFile:

  • $iceberg_partitions_table (BE scanner) — when the target snapshot is reachable from partition-statistics, build the partition row from PartitionStats (which carries persisted last_updated_at / last_updated_snapshot_id) merged with manifests added between the stats snapshot and the target snapshot.
  • IcebergCatalog.getPartitions (FE) — same fast path so MV refresh, predicate listing, and pinned-snapshot reads use the persisted counters. Fingerprint must stay bit-equivalent with the manifest-scan path so cached partition entries from either path remain interchangeable.

Both paths should share one implementation (stats file reader + incremental manifest merge) since they consume the same Iceberg API.

A kill switch (session variable) lets operators disable the fast path if a stats file turns out to be corrupted or out of sync with the table.

Why now

Without this, the only way to keep $partitions and MV refresh correct on Iceberg tables is to never expire snapshots — which defeats the purpose of expire_snapshots (storage reclamation, metadata size control). Tables that rely on Iceberg's own compute_partition_stats action to populate the stats file already pay the write cost but get no read-side benefit in StarRocks.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions