[SPARK-56898] Rewrite COUNT(DISTINCT IF) to COUNT(DISTINCT) FILTER for Expand reduction by xumingming · Pull Request #55925 · apache/spark

xumingming · 2026-05-17T02:26:44Z

What changes were proposed in this pull request?

Adds RewriteCountDistinctConditional optimizer rule that canonicalizes:

  COUNT(DISTINCT IF(cond, base, NULL))
  COUNT(DISTINCT CASE WHEN cond THEN base END)

into:

  COUNT(DISTINCT base) FILTER (WHERE cond)

This reduces the number of distinct groups seen by RewriteDistinctAggregates from N (one per unique conditional expression) down to 1 (all share the same base column), collapsing the Expand factor from Nx to 1x.

Gated by spark.sql.optimizer.rewriteCountDistinctConditional.enabled (default: false).

Includes comprehensive unit tests for rewrite patterns and safety boundaries.

Why are the changes needed?

When a query contains many COUNT(DISTINCT IF(cond_i, col, NULL)) expressions over the same base column, RewriteDistinctAggregates treats each unique IF(...) expression as a distinct group. N conditions → N distinct groups → N× Expand amplification. In production workloads with 25–60 such expressions, this produces multi-terabyte shuffles and hour-long runtimes.

SELECT
  user_id,
  COUNT(DISTINCT IF(dt >= '2026-04-16', order_id, NULL)) AS orders_30d,
  COUNT(DISTINCT IF(dt >= '2026-02-15', order_id, NULL)) AS orders_90d,
  COUNT(DISTINCT IF(pay_status = 'paid', order_id, NULL)) AS orders_paid,
  -- ... 50 more expressions
FROM transactions
GROUP BY user_id

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit Test.

Was this patch authored or co-authored using generative AI tooling?

No.

xumingming · 2026-05-19T02:06:35Z

@LuciferYang Can you help take a look at this PR?

… FILTER for Expand reduction Adds RewriteCountDistinctConditional optimizer rule that canonicalizes: COUNT(DISTINCT IF(cond, base, NULL)) COUNT(DISTINCT CASE WHEN cond THEN base END) into: COUNT(DISTINCT base) FILTER (WHERE cond) This reduces the number of distinct groups seen by RewriteDistinctAggregates from N (one per unique conditional expression) down to 1 (all share the same base column), collapsing the Expand factor from Nx to 1x. Gated by spark.sql.optimizer.rewriteCountDistinctConditional.enabled (default: false).

xumingming force-pushed the count-distinct-filter-rewrite branch 3 times, most recently from cc0a893 to 3bc29c1 Compare May 17, 2026 11:58

xumingming changed the title ~~[SPARK-56898] feat: rewrite COUNT(DISTINCT IF) to COUNT(DISTINCT) FILTER for Expand reduction~~ [SPARK-56898] Rewrite COUNT(DISTINCT IF) to COUNT(DISTINCT) FILTER for Expand reduction May 17, 2026

xumingming force-pushed the count-distinct-filter-rewrite branch from 3bc29c1 to 028cf1b Compare May 19, 2026 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56898] Rewrite COUNT(DISTINCT IF) to COUNT(DISTINCT) FILTER for Expand reduction#55925

[SPARK-56898] Rewrite COUNT(DISTINCT IF) to COUNT(DISTINCT) FILTER for Expand reduction#55925
xumingming wants to merge 1 commit into
apache:masterfrom
xumingming:count-distinct-filter-rewrite

xumingming commented May 17, 2026 •

edited

Loading

Uh oh!

xumingming commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xumingming commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xumingming commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xumingming commented May 17, 2026 •

edited

Loading