Skip to content

[SPARK-56935][SQL] Simplify GetArrayItem codegen and consolidate ElementAtUtils into ArrayExpressionUtils#55973

Closed
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-56935-getarrayitem
Closed

[SPARK-56935][SQL] Simplify GetArrayItem codegen and consolidate ElementAtUtils into ArrayExpressionUtils#55973
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-56935-getarrayitem

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Two related changes:

  1. Fold ElementAtUtils.resolveArrayIndex into the existing ArrayExpressionUtils.java, and remove ElementAtUtils.java. The per-expression naming chosen in SPARK-56916 didn't match the codebase's category-scoped utility-class convention (ArrayExpressionUtils, BitmapExpressionUtils, ExpressionImplUtils, ...) and there's now a natural home for any future array-expression ANSI helper.

  2. Refactor GetArrayItem's ANSI codegen + eval paths to use a new ArrayExpressionUtils.checkArrayIndex(int length, int index, QueryContext context) helper, mirroring how ElementAt uses resolveArrayIndex. The helper throws invalidArrayIndexError for negative / out-of-bound ANSI indices and returns the validated 0-based position so the caller chains into arr.get(idx, dataType). The non-ANSI branch keeps its inline form because it must return null (not throw) on out-of-bound.

Net effect: the existing per-expression ElementAtUtils.java is removed; the existing ArrayExpressionUtils.java grows two *ArrayIndex helpers used by ElementAt and GetArrayItem codegen + eval.

Why are the changes needed?

Part of SPARK-56908 (umbrella). arr[idx] and element_at(arr, idx) share the same ANSI out-of-bound error shape; collapsing both into one-line helper calls keeps the codegen size small and avoids maintaining two parallel inline forms.

Does this PR introduce any user-facing change?

No. The compiled behavior is identical; only the emitted Java source text changes.

How was this patch tested?

build/sbt "catalyst/testOnly *ComplexTypeSuite *CollectionExpressionsSuite"
build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite"

All pass (83/83 catalyst, 21/21 sql).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x

…entAtUtils into ArrayExpressionUtils

### What changes were proposed in this pull request?

Two related changes:

1. Fold `ElementAtUtils.resolveArrayIndex` into the existing
   `ArrayExpressionUtils.java`, and remove `ElementAtUtils.java`.
   The per-expression naming chosen in SPARK-56916 didn't match the
   codebase's category-scoped utility-class convention
   (`ArrayExpressionUtils`, `BitmapExpressionUtils`,
   `ExpressionImplUtils`, ...) and there's now a natural home for any
   future array-expression ANSI helper.

2. Refactor `GetArrayItem`'s ANSI codegen + eval paths to use a new
   `ArrayExpressionUtils.checkArrayIndex(int length, int index,
   QueryContext context)` helper, mirroring how `ElementAt` uses
   `resolveArrayIndex`. The helper throws
   `invalidArrayIndexError` for negative / out-of-bound ANSI indices
   and returns the validated 0-based position so the caller chains
   into `arr.get(idx, dataType)`. The non-ANSI branch keeps its inline
   form because it must return `null` (not throw) on out-of-bound.

Net effect: the existing per-expression `ElementAtUtils.java` is
removed; the existing `ArrayExpressionUtils.java` grows two
`*ArrayIndex` helpers used by `ElementAt` and `GetArrayItem` codegen +
eval.

### Why are the changes needed?

Part of SPARK-56908 (umbrella). `arr[idx]` and `element_at(arr, idx)`
share the same ANSI out-of-bound error shape; collapsing both into
one-line helper calls keeps the codegen size small and avoids
maintaining two parallel inline forms.

### Does this PR introduce _any_ user-facing change?

No. The compiled behavior is identical; only the emitted Java source
text changes.

### How was this patch tested?

```
build/sbt "catalyst/testOnly *ComplexTypeSuite *CollectionExpressionsSuite"
build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite"
```

All pass (83/83 catalyst, 21/21 sql).

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x
@gengliangwang gengliangwang requested review from cloud-fan and viirya May 19, 2026 05:09
@gengliangwang
Copy link
Copy Markdown
Member Author

Audited this PR for the same lessons surfaced by @viirya and @cloud-fan on #55938 (and applied to #55934 / #55939):

  1. Are the helpers redundant with an existing Scala API? No — resolveArrayIndex and checkArrayIndex each wrap a specific QueryExecutionErrors throw (invalidElementAtIndexError / invalidIndexOfZeroError / invalidArrayIndexError) around the per-row bounds check. No existing object encapsulates these throws.
  2. Are the eval-path additions redundant? No — master GetArrayItem.nullSafeEval was a multi-line bounds-check + throw body for ANSI mode. The new helper collapses it to one line.

So no changes needed on this PR for that review.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor with two parts: (1) fold resolveArrayIndex from the per-expression ElementAtUtils.java into the category-scoped ArrayExpressionUtils.java and delete ElementAtUtils.java; (2) restructure GetArrayItem's ANSI eval/codegen paths to call into a new ArrayExpressionUtils.checkArrayIndex helper, mirroring how ElementAt uses resolveArrayIndex.

Prior state and problem. SPARK-56916 introduced ElementAtUtils.java next to ArrayExpressionUtils.java to host a single ANSI-bounds helper for ElementAt. That per-expression class name diverged from the codebase's category-scoped utility convention (ArrayExpressionUtils, BitmapExpressionUtils, ExpressionImplUtils, ...). Meanwhile GetArrayItem's ANSI bounds check was still inlined in both nullSafeEval and doGenCode, duplicating logic shared with ElementAt.

Design approach. Two helpers, named to reflect their input convention: resolveArrayIndex (1-based -> 0-based + bounds check + zero check; used by element_at) and checkArrayIndex (0-based bounds check only; used by arr[idx]). Both throw the original QueryExecutionErrors.invalid* errors, so user-facing behavior is unchanged. The non-ANSI branch deliberately stays inline -- it must return null rather than throw, which doesn't share with a throw-on-violation helper.

Implementation sketch. GetArrayItem.nullSafeEval and doGenCode are split into if (failOnError) ... else ... arms, matching ElementAt's ANSI/non-ANSI split structure. The ANSI arm calls the helper; the non-ANSI arm keeps its original inline form (returning null on OOB). ElementAt's two callsites are simple renames.

Behavior is preserved on every dimension I checked: same OOB predicate (index < 0 || index >= length), same error class (invalidArrayIndexError(index, length, context)), same in-bounds null/non-null handling, same non-ANSI null fallback. Tests are limited to the existing suites (ComplexTypeSuite, CollectionExpressionsSuite, QueryExecutionAnsiErrorsSuite), which is appropriate for a refactor with claimed-equivalent behavior. Import cleanup of QueryExecutionErrors in complexTypeExtractors.scala is correct -- no other usages remain in that file. No remaining ElementAtUtils references anywhere in sql/ after deletion.

One minor comment nit inline.

…essions/complexTypeExtractors.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a clean follow-up to SPARK-56916 — pure refactor, behavior-preserving, and the rename/relocation aligns with the existing *ExpressionUtils convention. A couple of small comments below; none are blockers.

Correctness

  • resolveArrayIndex is moved verbatim, error types (invalidElementAtIndexError / invalidIndexOfZeroError) preserved.
  • checkArrayIndex: the predicate index < 0 || index >= length is equivalent to the original index >= numElements || index < 0, and it correctly raises invalidArrayIndexError (i.e. INVALID_ARRAY_INDEX, distinct from element_at's INVALID_ARRAY_INDEX_IN_ELEMENT_AT).
  • GetArrayItem.nullSafeEval / doGenCode ANSI and non-ANSI branches are behaviorally identical to the prior code. The split of the codegen into two top-level branches (instead of one branch with if (failOnError) interleaved) is a readability win and mirrors what SPARK-56916 did for ElementAt.

Minor: discarded return value in nullSafeEval

case _: ArrayType if failOnError =>
  ArrayExpressionUtils.checkArrayIndex(baseValue.numElements(), index, getContextOrNull())
  if (baseValue.isNullAt(index)) null else baseValue.get(index, dataType)

checkArrayIndex is documented as returning the validated 0-based index, but the caller throws it away here — the call reads like a pure side-effect. That's mildly inconsistent with ElementAt, where the return value is actually consumed (1-based → 0-based). Two options, both fine:

  1. Use the returned value: val idx = ArrayExpressionUtils.checkArrayIndex(...); if (baseValue.isNullAt(idx)) null else baseValue.get(idx, dataType) — keeps symmetry with ElementAt and removes any doubt that the variable referenced after the call is the validated one.
  2. Keep as-is and accept that for GetArrayItem the return is identity; the Javadoc already says so.

I have a mild preference for (1).

Minor: codegen final lost

The ANSI codegen used to emit final int $index = (int) $eval2;; the new version emits int $index = ...checkArrayIndex(...). Behavior identical, just noting the final is gone. Not worth changing.

Style nits

  • ArrayExpressionUtils.java could use a clearer section header now that it hosts two unrelated families (ANSI helpers + binary-search comparators). The current // ANSI index helpers used by ... works but a banner-style // === ANSI index helpers === would scale better as more helpers land.
  • Pre-existing: ArrayExpressionUtils is not final and has no private constructor. Out of scope for this PR, but worth a follow-up.

Tests

The three suites listed cover both expressions' ANSI / non-ANSI paths and the error messages, which is the right set for a refactor of this shape. No new tests needed.

LGTM pending the optional nit above.

@gengliangwang
Copy link
Copy Markdown
Member Author

@cloud-fan thanks for the reivew.
@viirya Thanks. I will keep it as it is

gengliangwang added a commit that referenced this pull request May 22, 2026
…entAtUtils into ArrayExpressionUtils

### What changes were proposed in this pull request?

Two related changes:

1. Fold `ElementAtUtils.resolveArrayIndex` into the existing `ArrayExpressionUtils.java`, and remove `ElementAtUtils.java`. The per-expression naming chosen in SPARK-56916 didn't match the codebase's category-scoped utility-class convention (`ArrayExpressionUtils`, `BitmapExpressionUtils`, `ExpressionImplUtils`, ...) and there's now a natural home for any future array-expression ANSI helper.

2. Refactor `GetArrayItem`'s ANSI codegen + eval paths to use a new `ArrayExpressionUtils.checkArrayIndex(int length, int index, QueryContext context)` helper, mirroring how `ElementAt` uses `resolveArrayIndex`. The helper throws `invalidArrayIndexError` for negative / out-of-bound ANSI indices and returns the validated 0-based position so the caller chains into `arr.get(idx, dataType)`. The non-ANSI branch keeps its inline form because it must return `null` (not throw) on out-of-bound.

Net effect: the existing per-expression `ElementAtUtils.java` is removed; the existing `ArrayExpressionUtils.java` grows two `*ArrayIndex` helpers used by `ElementAt` and `GetArrayItem` codegen + eval.

### Why are the changes needed?

Part of SPARK-56908 (umbrella). `arr[idx]` and `element_at(arr, idx)` share the same ANSI out-of-bound error shape; collapsing both into one-line helper calls keeps the codegen size small and avoids maintaining two parallel inline forms.

### Does this PR introduce _any_ user-facing change?

No. The compiled behavior is identical; only the emitted Java source text changes.

### How was this patch tested?

```
build/sbt "catalyst/testOnly *ComplexTypeSuite *CollectionExpressionsSuite"
build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite"
```

All pass (83/83 catalyst, 21/21 sql).

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x

Closes #55973 from gengliangwang/SPARK-56935-getarrayitem.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 12ab2c5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants