[SPARK-56935][SQL] Simplify GetArrayItem codegen and consolidate ElementAtUtils into ArrayExpressionUtils#55973
Conversation
…entAtUtils into ArrayExpressionUtils ### What changes were proposed in this pull request? Two related changes: 1. Fold `ElementAtUtils.resolveArrayIndex` into the existing `ArrayExpressionUtils.java`, and remove `ElementAtUtils.java`. The per-expression naming chosen in SPARK-56916 didn't match the codebase's category-scoped utility-class convention (`ArrayExpressionUtils`, `BitmapExpressionUtils`, `ExpressionImplUtils`, ...) and there's now a natural home for any future array-expression ANSI helper. 2. Refactor `GetArrayItem`'s ANSI codegen + eval paths to use a new `ArrayExpressionUtils.checkArrayIndex(int length, int index, QueryContext context)` helper, mirroring how `ElementAt` uses `resolveArrayIndex`. The helper throws `invalidArrayIndexError` for negative / out-of-bound ANSI indices and returns the validated 0-based position so the caller chains into `arr.get(idx, dataType)`. The non-ANSI branch keeps its inline form because it must return `null` (not throw) on out-of-bound. Net effect: the existing per-expression `ElementAtUtils.java` is removed; the existing `ArrayExpressionUtils.java` grows two `*ArrayIndex` helpers used by `ElementAt` and `GetArrayItem` codegen + eval. ### Why are the changes needed? Part of SPARK-56908 (umbrella). `arr[idx]` and `element_at(arr, idx)` share the same ANSI out-of-bound error shape; collapsing both into one-line helper calls keeps the codegen size small and avoids maintaining two parallel inline forms. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *ComplexTypeSuite *CollectionExpressionsSuite" build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" ``` All pass (83/83 catalyst, 21/21 sql). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
|
Audited this PR for the same lessons surfaced by @viirya and @cloud-fan on #55938 (and applied to #55934 / #55939):
So no changes needed on this PR for that review. |
cloud-fan
left a comment
There was a problem hiding this comment.
Clean refactor with two parts: (1) fold resolveArrayIndex from the per-expression ElementAtUtils.java into the category-scoped ArrayExpressionUtils.java and delete ElementAtUtils.java; (2) restructure GetArrayItem's ANSI eval/codegen paths to call into a new ArrayExpressionUtils.checkArrayIndex helper, mirroring how ElementAt uses resolveArrayIndex.
Prior state and problem. SPARK-56916 introduced ElementAtUtils.java next to ArrayExpressionUtils.java to host a single ANSI-bounds helper for ElementAt. That per-expression class name diverged from the codebase's category-scoped utility convention (ArrayExpressionUtils, BitmapExpressionUtils, ExpressionImplUtils, ...). Meanwhile GetArrayItem's ANSI bounds check was still inlined in both nullSafeEval and doGenCode, duplicating logic shared with ElementAt.
Design approach. Two helpers, named to reflect their input convention: resolveArrayIndex (1-based -> 0-based + bounds check + zero check; used by element_at) and checkArrayIndex (0-based bounds check only; used by arr[idx]). Both throw the original QueryExecutionErrors.invalid* errors, so user-facing behavior is unchanged. The non-ANSI branch deliberately stays inline -- it must return null rather than throw, which doesn't share with a throw-on-violation helper.
Implementation sketch. GetArrayItem.nullSafeEval and doGenCode are split into if (failOnError) ... else ... arms, matching ElementAt's ANSI/non-ANSI split structure. The ANSI arm calls the helper; the non-ANSI arm keeps its original inline form (returning null on OOB). ElementAt's two callsites are simple renames.
Behavior is preserved on every dimension I checked: same OOB predicate (index < 0 || index >= length), same error class (invalidArrayIndexError(index, length, context)), same in-bounds null/non-null handling, same non-ANSI null fallback. Tests are limited to the existing suites (ComplexTypeSuite, CollectionExpressionsSuite, QueryExecutionAnsiErrorsSuite), which is appropriate for a refactor with claimed-equivalent behavior. Import cleanup of QueryExecutionErrors in complexTypeExtractors.scala is correct -- no other usages remain in that file. No remaining ElementAtUtils references anywhere in sql/ after deletion.
One minor comment nit inline.
…essions/complexTypeExtractors.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
viirya
left a comment
There was a problem hiding this comment.
Overall this is a clean follow-up to SPARK-56916 — pure refactor, behavior-preserving, and the rename/relocation aligns with the existing *ExpressionUtils convention. A couple of small comments below; none are blockers.
Correctness
resolveArrayIndexis moved verbatim, error types (invalidElementAtIndexError/invalidIndexOfZeroError) preserved.checkArrayIndex: the predicateindex < 0 || index >= lengthis equivalent to the originalindex >= numElements || index < 0, and it correctly raisesinvalidArrayIndexError(i.e.INVALID_ARRAY_INDEX, distinct fromelement_at'sINVALID_ARRAY_INDEX_IN_ELEMENT_AT).GetArrayItem.nullSafeEval/doGenCodeANSI and non-ANSI branches are behaviorally identical to the prior code. The split of the codegen into two top-level branches (instead of one branch withif (failOnError)interleaved) is a readability win and mirrors what SPARK-56916 did forElementAt.
Minor: discarded return value in nullSafeEval
case _: ArrayType if failOnError =>
ArrayExpressionUtils.checkArrayIndex(baseValue.numElements(), index, getContextOrNull())
if (baseValue.isNullAt(index)) null else baseValue.get(index, dataType)checkArrayIndex is documented as returning the validated 0-based index, but the caller throws it away here — the call reads like a pure side-effect. That's mildly inconsistent with ElementAt, where the return value is actually consumed (1-based → 0-based). Two options, both fine:
- Use the returned value:
val idx = ArrayExpressionUtils.checkArrayIndex(...); if (baseValue.isNullAt(idx)) null else baseValue.get(idx, dataType)— keeps symmetry withElementAtand removes any doubt that the variable referenced after the call is the validated one. - Keep as-is and accept that for
GetArrayItemthe return is identity; the Javadoc already says so.
I have a mild preference for (1).
Minor: codegen final lost
The ANSI codegen used to emit final int $index = (int) $eval2;; the new version emits int $index = ...checkArrayIndex(...). Behavior identical, just noting the final is gone. Not worth changing.
Style nits
ArrayExpressionUtils.javacould use a clearer section header now that it hosts two unrelated families (ANSI helpers + binary-search comparators). The current// ANSI index helpers used by ...works but a banner-style// === ANSI index helpers ===would scale better as more helpers land.- Pre-existing:
ArrayExpressionUtilsis notfinaland has no private constructor. Out of scope for this PR, but worth a follow-up.
Tests
The three suites listed cover both expressions' ANSI / non-ANSI paths and the error messages, which is the right set for a refactor of this shape. No new tests needed.
LGTM pending the optional nit above.
|
@cloud-fan thanks for the reivew. |
…entAtUtils into ArrayExpressionUtils ### What changes were proposed in this pull request? Two related changes: 1. Fold `ElementAtUtils.resolveArrayIndex` into the existing `ArrayExpressionUtils.java`, and remove `ElementAtUtils.java`. The per-expression naming chosen in SPARK-56916 didn't match the codebase's category-scoped utility-class convention (`ArrayExpressionUtils`, `BitmapExpressionUtils`, `ExpressionImplUtils`, ...) and there's now a natural home for any future array-expression ANSI helper. 2. Refactor `GetArrayItem`'s ANSI codegen + eval paths to use a new `ArrayExpressionUtils.checkArrayIndex(int length, int index, QueryContext context)` helper, mirroring how `ElementAt` uses `resolveArrayIndex`. The helper throws `invalidArrayIndexError` for negative / out-of-bound ANSI indices and returns the validated 0-based position so the caller chains into `arr.get(idx, dataType)`. The non-ANSI branch keeps its inline form because it must return `null` (not throw) on out-of-bound. Net effect: the existing per-expression `ElementAtUtils.java` is removed; the existing `ArrayExpressionUtils.java` grows two `*ArrayIndex` helpers used by `ElementAt` and `GetArrayItem` codegen + eval. ### Why are the changes needed? Part of SPARK-56908 (umbrella). `arr[idx]` and `element_at(arr, idx)` share the same ANSI out-of-bound error shape; collapsing both into one-line helper calls keeps the codegen size small and avoids maintaining two parallel inline forms. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *ComplexTypeSuite *CollectionExpressionsSuite" build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" ``` All pass (83/83 catalyst, 21/21 sql). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x Closes #55973 from gengliangwang/SPARK-56935-getarrayitem. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 12ab2c5) Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Two related changes:
Fold
ElementAtUtils.resolveArrayIndexinto the existingArrayExpressionUtils.java, and removeElementAtUtils.java. The per-expression naming chosen in SPARK-56916 didn't match the codebase's category-scoped utility-class convention (ArrayExpressionUtils,BitmapExpressionUtils,ExpressionImplUtils, ...) and there's now a natural home for any future array-expression ANSI helper.Refactor
GetArrayItem's ANSI codegen + eval paths to use a newArrayExpressionUtils.checkArrayIndex(int length, int index, QueryContext context)helper, mirroring howElementAtusesresolveArrayIndex. The helper throwsinvalidArrayIndexErrorfor negative / out-of-bound ANSI indices and returns the validated 0-based position so the caller chains intoarr.get(idx, dataType). The non-ANSI branch keeps its inline form because it must returnnull(not throw) on out-of-bound.Net effect: the existing per-expression
ElementAtUtils.javais removed; the existingArrayExpressionUtils.javagrows two*ArrayIndexhelpers used byElementAtandGetArrayItemcodegen + eval.Why are the changes needed?
Part of SPARK-56908 (umbrella).
arr[idx]andelement_at(arr, idx)share the same ANSI out-of-bound error shape; collapsing both into one-line helper calls keeps the codegen size small and avoids maintaining two parallel inline forms.Does this PR introduce any user-facing change?
No. The compiled behavior is identical; only the emitted Java source text changes.
How was this patch tested?
All pass (83/83 catalyst, 21/21 sql).
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x