Enhancement
Problem
For Iceberg tables, last_updated_at and last_updated_snapshot_id per partition are not stored in any manifest — Iceberg derives them on the fly by walking ManifestEntry.snapshot_id() for every live data/delete file in a partition and looking up Snapshot.timestampMillis() from TableMetadata.snapshots().
This works only as long as the snapshots that originally wrote each partition are still in TableMetadata.snapshots(). After expire_snapshots (or expire_older_than) drops those snapshots:
Snapshot snap = table.snapshot(entry.snapshotId()) returns null for the expired writers
last_updated_at / last_updated_snapshot_id for those partitions become unrecoverable from manifests alone
Iceberg introduced PartitionStatisticsFile (referenced from TableMetadata.partition-statistics) precisely to persist these per-partition counters — including last_updated_at and last_updated_snapshot_id — across snapshot expiration. Engines that read it can keep reporting correct values; engines that don't, lose data after the first expire_snapshots.
Impact in StarRocks today
StarRocks currently never reads PartitionStatisticsFile. After expire_snapshots on an Iceberg table:
-
SELECT * FROM <iceberg_table>$partitions — last_updated_at and last_updated_snapshot_id columns become NULL (or 0 / -1 depending on path) for any partition whose writing snapshot has been expired. Users querying this metadata table for ops/audit see incomplete data.
-
MV auto-refresh — IcebergCatalog.getPartitions populates Partition.modifiedTime from the same manifest walk. DefaultTraits.getUpdatedPartitionNames compares modifiedTime against MV.BasePartitionInfo.lastRefreshTime to decide which partitions need refresh. After expire:
- Partitions with expired writers get
modifiedTime = 0 (or earlier wall-clock than the MV refresh) → MV treats them as "not updated" and skips them even if their data actually changed.
- In the opposite direction, any refresh strategy that hashes manifest contents (e.g.
getPartitionVersion) sees dvCount / position_delete_* shift after table maintenance (rewrite_files, DV compaction) and over-refreshes partitions whose logical data is unchanged.
Combined effect: on long-lived Iceberg tables with periodic expire_snapshots, the MV refresh decision is wrong in both directions, and $partitions shows wrong audit data.
Proposed solution
Teach the FE Iceberg connector to read PartitionStatisticsFile:
$iceberg_partitions_table (BE scanner) — when the target snapshot is reachable from partition-statistics, build the partition row from PartitionStats (which carries persisted last_updated_at / last_updated_snapshot_id) merged with manifests added between the stats snapshot and the target snapshot.
IcebergCatalog.getPartitions (FE) — same fast path so MV refresh, predicate listing, and pinned-snapshot reads use the persisted counters. Fingerprint must stay bit-equivalent with the manifest-scan path so cached partition entries from either path remain interchangeable.
Both paths should share one implementation (stats file reader + incremental manifest merge) since they consume the same Iceberg API.
A kill switch (session variable) lets operators disable the fast path if a stats file turns out to be corrupted or out of sync with the table.
Why now
Without this, the only way to keep $partitions and MV refresh correct on Iceberg tables is to never expire snapshots — which defeats the purpose of expire_snapshots (storage reclamation, metadata size control). Tables that rely on Iceberg's own compute_partition_stats action to populate the stats file already pay the write cost but get no read-side benefit in StarRocks.
Enhancement
Problem
For Iceberg tables,
last_updated_atandlast_updated_snapshot_idper partition are not stored in any manifest — Iceberg derives them on the fly by walkingManifestEntry.snapshot_id()for every live data/delete file in a partition and looking upSnapshot.timestampMillis()fromTableMetadata.snapshots().This works only as long as the snapshots that originally wrote each partition are still in
TableMetadata.snapshots(). Afterexpire_snapshots(orexpire_older_than) drops those snapshots:Snapshot snap = table.snapshot(entry.snapshotId())returnsnullfor the expired writerslast_updated_at/last_updated_snapshot_idfor those partitions become unrecoverable from manifests aloneIceberg introduced
PartitionStatisticsFile(referenced fromTableMetadata.partition-statistics) precisely to persist these per-partition counters — includinglast_updated_atandlast_updated_snapshot_id— across snapshot expiration. Engines that read it can keep reporting correct values; engines that don't, lose data after the firstexpire_snapshots.Impact in StarRocks today
StarRocks currently never reads
PartitionStatisticsFile. Afterexpire_snapshotson an Iceberg table:SELECT * FROM <iceberg_table>$partitions—last_updated_atandlast_updated_snapshot_idcolumns becomeNULL(or 0 / -1 depending on path) for any partition whose writing snapshot has been expired. Users querying this metadata table for ops/audit see incomplete data.MV auto-refresh —
IcebergCatalog.getPartitionspopulatesPartition.modifiedTimefrom the same manifest walk.DefaultTraits.getUpdatedPartitionNamescomparesmodifiedTimeagainstMV.BasePartitionInfo.lastRefreshTimeto decide which partitions need refresh. After expire:modifiedTime = 0(or earlier wall-clock than the MV refresh) → MV treats them as "not updated" and skips them even if their data actually changed.getPartitionVersion) seesdvCount/position_delete_*shift after table maintenance (rewrite_files, DV compaction) and over-refreshes partitions whose logical data is unchanged.Combined effect: on long-lived Iceberg tables with periodic
expire_snapshots, the MV refresh decision is wrong in both directions, and$partitionsshows wrong audit data.Proposed solution
Teach the FE Iceberg connector to read
PartitionStatisticsFile:$iceberg_partitions_table(BE scanner) — when the target snapshot is reachable frompartition-statistics, build the partition row fromPartitionStats(which carries persistedlast_updated_at/last_updated_snapshot_id) merged with manifests added between the stats snapshot and the target snapshot.IcebergCatalog.getPartitions(FE) — same fast path so MV refresh, predicate listing, and pinned-snapshot reads use the persisted counters. Fingerprint must stay bit-equivalent with the manifest-scan path so cached partition entries from either path remain interchangeable.Both paths should share one implementation (stats file reader + incremental manifest merge) since they consume the same Iceberg API.
A kill switch (session variable) lets operators disable the fast path if a stats file turns out to be corrupted or out of sync with the table.
Why now
Without this, the only way to keep
$partitionsand MV refresh correct on Iceberg tables is to never expire snapshots — which defeats the purpose ofexpire_snapshots(storage reclamation, metadata size control). Tables that rely on Iceberg's owncompute_partition_statsaction to populate the stats file already pay the write cost but get no read-side benefit in StarRocks.