Description:
In version 4.0.3-fbcf73d, we encountered an inconsistency issue where the SHOW PROC '/statistic' output on the BE leader node differs from other BE nodes.
Problem Details
-
The BE leader node reports two additional abnormal replica entries that do not appear on other nodes.
-
Both affected tables are single-replica tables.
-
The abnormal tablets show:
state = NORMAL
- One replica includes a
LastFailedTime
-
These abnormal entries are not automatically cleaned up.
-
Manually marking the replicas as bad is ineffective.
Impact
When querying the affected table through this FE leader node, the query fails with:
SQL Error [1064] [42000]: Build Exec OlapScanNode fail, scan info is invalid
Observed Behavior
-
The invalid replica metadata persists only on the leader node.
-
Other nodes do not report these abnormal replicas.
-
After restarting the leader node:
- The stale replicas are identified as redundant replicas
- They are automatically cleaned up
Expected Behavior
- Replica metadata should remain consistent across nodes
- Stale/invalid abnormal replicas should be automatically cleaned without requiring leader restart
SHOW PROC '/statistic' should not retain ghost replica entries
- Manual
bad marking should properly remove invalid replicas
Reproduction Summary
- Single-replica table enters abnormal state
SHOW PROC '/statistic' on leader node shows extra abnormal replica entries
- Replica state remains
NORMAL but contains failure timestamp
- Querying through leader FE fails with invalid
OlapScanNode
- Restart leader node
- Redundant replicas are cleaned
Questions / Investigation Request
Please help investigate:
- Why does the leader node retain stale replica metadata while other nodes do not?
- Why are these ghost replicas not cleaned automatically?
- Why does
ADMIN SET REPLICA STATUS PROPERTIES("status"="bad") fail to remove them?
- Is there a metadata cache or replay inconsistency issue related to leader state?
- Could this be caused by delayed tablet report reconciliation or replica version state handling?
Version
Description:
In version 4.0.3-fbcf73d, we encountered an inconsistency issue where the
SHOW PROC '/statistic'output on the BE leader node differs from other BE nodes.Problem Details
The BE leader node reports two additional abnormal replica entries that do not appear on other nodes.
Both affected tables are single-replica tables.
The abnormal tablets show:
state = NORMALLastFailedTimeThese abnormal entries are not automatically cleaned up.
Manually marking the replicas as
badis ineffective.Impact
When querying the affected table through this FE leader node, the query fails with:
Observed Behavior
The invalid replica metadata persists only on the leader node.
Other nodes do not report these abnormal replicas.
After restarting the leader node:
Expected Behavior
SHOW PROC '/statistic'should not retain ghost replica entriesbadmarking should properly remove invalid replicasReproduction Summary
SHOW PROC '/statistic'on leader node shows extra abnormal replica entriesNORMALbut contains failure timestampOlapScanNodeQuestions / Investigation Request
Please help investigate:
ADMIN SET REPLICA STATUS PROPERTIES("status"="bad")fail to remove them?Version