[SPARK-56907][SQL] Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY Parquet vectorized reader#55932
Open
iemejia wants to merge 1 commit into
Open
[SPARK-56907][SQL] Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY Parquet vectorized reader#55932iemejia wants to merge 1 commit into
iemejia wants to merge 1 commit into
Conversation
…RRAY Parquet vectorized reader This PR reduces object allocation in the DELTA_LENGTH_BYTE_ARRAY vectorized Parquet reader (`VectorizedDeltaLengthByteArrayReader`) by applying three targeted changes: **readBinary**: Replace per-value `in.slice(length)` (one ByteBuffer allocation per value) with a single bulk `in.slice(totalDataLen)` that reads the entire batch at once. Individual values are then written to the column vector via `putByteArray` from the shared backing array, eliminating N-1 ByteBuffer object allocations. **skipBinary**: Replace the per-value skip loop (N separate `in.skip()` calls) with a single bulk skip by summing all value lengths upfront. **readGeoData**: Remove the `ByteBuffer.wrap()` + `ByteBufferOutputWriter` indirection per value and call `putByteArray` directly from the converter output array.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR reduces object allocation in the DELTA_LENGTH_BYTE_ARRAY vectorized Parquet reader (
VectorizedDeltaLengthByteArrayReader) by applying three targeted changes:readBinary: Replace per-value
in.slice(length)(one ByteBuffer allocation per value) with a single bulkin.slice(totalDataLen)that reads the entire batch at once. Individual values are then written to the column vector viaputByteArrayfrom the shared backing array, eliminating N-1 ByteBuffer object allocations.skipBinary: Replace the per-value skip loop (N separate
in.skip()calls) with a single bulk skip by summing all value lengths upfront.readGeoData: Remove the
ByteBuffer.wrap()+ByteBufferOutputWriterindirection per value and callputByteArraydirectly from the converter output array.Why are the changes needed?
The DELTA_LENGTH_BYTE_ARRAY encoding is used for binary/string columns in Parquet v2 pages. In the current vectorized reader,
readBinaryallocates oneByteBufferper value viain.slice(length), andskipBinaryperforms a separate stream skip per value. For large batches (e.g. 1M values per page), this creates significant allocation pressure and per-call overhead.Micro-benchmarks on
VectorizedDeltaReaderBenchmarkGroup D show:readBinaryspeedup is larger for small payloads where allocation cost dominates.skipBinaryshows consistent 1.4x improvement across all payload sizes.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests:
ParquetDeltaLengthByteArrayEncodingSuite(14 tests including serialization, random strings, empty strings, skip interleaving, and geo types) andParquetEncodingSuiteall pass.Benchmarks:
VectorizedDeltaReaderBenchmarkGroup D (DELTA_LENGTH_BYTE_ARRAY) run locally on JDK 17.Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenCode with Claude claude-opus-4.6