[SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode#55934
[SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode#55934gengliangwang wants to merge 7 commits into
Conversation
Stack overview (SPARK-56908 umbrella)This PR is part of a stack of 8 PRs against SPARK-56908. Order:
PRs 1-4 are linearly stacked on each other (each branch is based on the previous one). PR 5 (decimal arithmetic) is stacked on top of PR 3 (cast decimal) since it uses |
e10ad7a to
8eba972
Compare
### What changes were proposed in this pull request? Introduce `CastUtils.java` and use it from `Cast.scala` to collapse the multi-line ANSI overflow-check codegen for casts that target `int` and `long` into one-line static-method calls. Source and target `DataType` constants used in the overflow error message live as `private static final` fields on the helper class, so the happy path performs no per-row `references[]` lookups. Helpers added: * `longToIntExact(long)` for narrowing `long -> int`. * `floatToIntExact(float)`, `doubleToIntExact(double)` for fractional -> int. * `floatToLongExact(float)`, `doubleToLongExact(double)` for fractional -> long. `Cast.scala` changes: * `castIntegralTypeToIntegralTypeExactCode` and `castFractionToIntegralTypeCode` dispatch on the target type: `int` (and `long` for the fraction case) emit a `CastUtils.<...>Exact` call; byte/short targets keep the inline body (refactored in SPARK-56910). * Eval paths for `castToInt` add ANSI `LongType` / `FloatType` / `DoubleType` cases, and `castToLong` adds `FloatType` / `DoubleType` cases, both delegating to the new helpers. ### Why are the changes needed? Part of SPARK-56908. The current ANSI cast codegen emits 5-line inline overflow blocks per call site. Multiplied across the many cast paths in a TPC-DS plan, this contributes meaningfully to the generated source size and to Janino compile time, and pushes whole-stage methods closer to the 64KB JVM method limit. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? `build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite *CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite *ExpressionClassIdentitySuite"` — 312/312 pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
8eba972 to
7209218
Compare
…Numeric/DoubleExactNumeric directly, remove CastUtils.java
…bleExactNumeric in codegen
cloud-fan
left a comment
There was a problem hiding this comment.
Refactor cleanly delegates four ANSI narrow-cast codegen paths (Long→Int and Float/Double→Int/Long) to the existing LongExactNumeric/FloatExactNumeric/DoubleExactNumeric objects in numerics.scala, replacing 5-line inline bodies per call site with a single static call. The getClass.getCanonicalName.stripSuffix("$") pattern matches the well-established convention already used in Cast.scala for IntervalUtils, MathUtils, UTF8StringUtils, etc.
Behavior preserved — the helpers implement the same bounds check (x == x.toInt for Long→Int; Math.floor/ceil for Float/Double→Int/Long) and throw the same castingCauseOverflowError with matching DataType arguments. Deferring byte/short narrowing to SPARK-56910 is sensible since ByteExactNumeric/ShortExactNumeric don't yet expose narrowing methods; the interim if/else inside each helper collapses once that lands. Good call dropping the v1 CastUtils.java in favor of reusing existing infrastructure.
LGTM with one trivial nit inline.
viirya
left a comment
There was a problem hiding this comment.
LGTM. The rewrite is behavior-preserving and I verified the equivalence point-by-point against numerics.scala:
LongExactNumeric.toInt(numerics.scala L112-L118) doesif (x == x.toInt) x.toInt else throw castingCauseOverflowError(x, LongType, IntegerType)— same as the original inline body, and the call site atCast.scalaL2138 is gated bycase LongType, so the hard-codedLongType/IntegerTypein the error match the runtimefrom/to.FloatExactNumeric.toInt|toLongandDoubleExactNumeric.toInt|toLong(numerics.scala L127-L174) implement the sameMath.floor(x) <= upper && Math.ceil(x) >= lowercheck and throw with the same(value, FROM, TO)triple; the call sites at L2140/L2176 are gated bycase FloatType | DoubleType.- The
getClass.getCanonicalName.stripSuffix("$")pattern matches ~17 other usages in this file (IntervalUtils,UTF8StringUtils, etc.) and is already exercised in production (e.g.DateTimeUtilscallsDoubleExactNumeric.toLongdirectly).private[sql] objectdoesn't block Java access — Scala emits the bytecode as public with static forwarders on the companion class.
Two small readability nits, non-blocking:
-
In
castFractionToIntegralTypeCode, thefrom match { case FloatType => ...; case DoubleType => ... }has no default arm. It's safe today because the only callers (L2140, L2176) already gate onFloatType | DoubleType, but it creates an implicit invariant that would silently turn into aMatchErrorif a new fractional type were ever added and routed here. Consider an explicitcase _ => throw SparkException.internalError(...)or a preconditionassert. -
castIntegralTypeToIntegralTypeExactCodeis named as a generic integral-narrowing helper, but theintbranch now hard-codesLongExactNumeric. That's correct because L2138 (case LongType) is the only path into that branch, but the asymmetry with the byte/short branches (still generic) and with the fraction helper (which doesfrom match) is a bit jarring. Either a one-line doc note ("the int branch is only reachable for from=LongType") or rewriting it asfrom match { case LongType => LongExactNumeric }for symmetry would read more cleanly.
Neither needs to block this PR.
…essions/Cast.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
|
@viirya thanks for the careful review and the equivalence verification. Pushed
(Plus
|
|
@cloud-fan @viirya thanks for the review. |
### What changes were proposed in this pull request?
In `Cast.scala`, the ANSI codegen for narrowing casts to `int` / `long` previously emitted a 5-line inline body per call site (bounds check + cast + throw). After this PR it emits a single static call into the existing `LongExactNumeric` / `FloatExactNumeric` / `DoubleExactNumeric` objects in `numerics.scala`, which already implement the same overflow check + `castingCauseOverflowError` throw that this codegen needs.
The rewrite uses the same `getClass.getCanonicalName.stripSuffix("$")` pattern as the adjacent `MathUtils` / `IntervalMathUtils` calls. The Scala compiler emits `public static` forwarders on the companion class of top-level objects, so generated Java code can call e.g. `org.apache.spark.sql.types.LongExactNumeric.toInt(v)` directly.
Touched `Cast.scala` helpers:
* `castIntegralTypeToIntegralTypeExactCode`: the `int` target branch now emits `LongExactNumeric.toInt($c)` (byte/short narrowing stays inline; refactored in SPARK-56910).
* `castFractionToIntegralTypeCode`: the `int` / `long` target branches now emit `FloatExactNumeric` / `DoubleExactNumeric` `toInt` / `toLong` (byte/short narrowing stays inline; refactored in SPARK-56910).
Primitive widening branches and the non-ANSI paths are untouched.
### Why are the changes needed?
Part of SPARK-56908 (umbrella). The narrow-cast ANSI branches in `Cast.doGenCode` are some of the longer inline bodies still emitted per call site. Multiplied across the many cast paths in a TPC-DS plan, they contribute meaningfully to the generated source size and Janino compile time, and push whole-stage methods closer to the 64KB JVM method limit.
Compared to v1 of this PR (which added a new `CastUtils.java` with `longToIntExact` / `floatToIntExact` / etc.), this version calls the existing `LongExactNumeric.toInt` / `FloatExactNumeric.toInt` / `toLong` / `DoubleExactNumeric.toInt` / `toLong` directly. Those are public static forwarders on top-level Scala objects that already implement the same `castingCauseOverflowError(v, FROM, TO)` throw — no new helper class needed. (Applying the same lesson cloud-fan called out on #55938.)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite \
*CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite *ExpressionClassIdentitySuite"
```
307/307 pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x
Closes #55934 from gengliangwang/SPARK-56909-cast-int-long.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 9e13442)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
In
Cast.scala, the ANSI codegen for narrowing casts toint/longpreviously emitted a 5-line inline body per call site (bounds check + cast + throw). After this PR it emits a single static call into the existingLongExactNumeric/FloatExactNumeric/DoubleExactNumericobjects innumerics.scala, which already implement the same overflow check +castingCauseOverflowErrorthrow that this codegen needs.The rewrite uses the same
getClass.getCanonicalName.stripSuffix("$")pattern as the adjacentMathUtils/IntervalMathUtilscalls. The Scala compiler emitspublic staticforwarders on the companion class of top-level objects, so generated Java code can call e.g.org.apache.spark.sql.types.LongExactNumeric.toInt(v)directly.Touched
Cast.scalahelpers:castIntegralTypeToIntegralTypeExactCode: theinttarget branch now emitsLongExactNumeric.toInt($c)(byte/short narrowing stays inline; refactored in SPARK-56910).castFractionToIntegralTypeCode: theint/longtarget branches now emitFloatExactNumeric/DoubleExactNumerictoInt/toLong(byte/short narrowing stays inline; refactored in SPARK-56910).Primitive widening branches and the non-ANSI paths are untouched.
Why are the changes needed?
Part of SPARK-56908 (umbrella). The narrow-cast ANSI branches in
Cast.doGenCodeare some of the longer inline bodies still emitted per call site. Multiplied across the many cast paths in a TPC-DS plan, they contribute meaningfully to the generated source size and Janino compile time, and push whole-stage methods closer to the 64KB JVM method limit.Compared to v1 of this PR (which added a new
CastUtils.javawithlongToIntExact/floatToIntExact/ etc.), this version calls the existingLongExactNumeric.toInt/FloatExactNumeric.toInt/toLong/DoubleExactNumeric.toInt/toLongdirectly. Those are public static forwarders on top-level Scala objects that already implement the samecastingCauseOverflowError(v, FROM, TO)throw — no new helper class needed. (Applying the same lesson cloud-fan called out on #55938.)Does this PR introduce any user-facing change?
No.
How was this patch tested?
307/307 pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x