Skip to content

[SPARK-56897][SQL] Reduce per-value allocations in DELTA_BYTE_ARRAY Parquet decoder#55924

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:SPARK-delta-byte-array
Open

[SPARK-56897][SQL] Reduce per-value allocations in DELTA_BYTE_ARRAY Parquet decoder#55924
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:SPARK-delta-byte-array

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 17, 2026

What changes were proposed in this pull request?

Reduces per-value heap allocations in VectorizedDeltaByteArrayReader (the Parquet DELTA_BYTE_ARRAY vectorized decoder) by replacing ByteBuffer-based state tracking with a reusable byte[] buffer.

Key changes:

  1. Replace ByteBuffer previous with byte[] prevBuf + int prevLen: The DELTA_BYTE_ARRAY encoding exploits prefix sharing between consecutive values. The old code materialized each decoded value as a ByteBuffer obtained from arrayData.getByteBuffer() (which allocates ~48 bytes per value via ByteBuffer.wrap()). The new approach uses a reusable byte[] buffer where the prefix bytes from the previous iteration are already in place at the start -- only the suffix portion needs to be written.

  2. Add getSuffixInto() to VectorizedDeltaLengthByteArrayReader: New method that reads suffix bytes directly into a caller-supplied byte[] via InputStream.read(), avoiding the ByteBuffer allocation that getBytes() / in.slice() performs. Also adds getSuffixLength() for cases where only the length is needed.

  3. Rewrite skipBinary() to use prevBuf directly: The old implementation was particularly wasteful -- it maintained two WritableColumnVector instances (binaryValVector and tempBinaryValVector), reset one per skipped value, and swapped them after each iteration. The new implementation simply assembles into prevBuf without touching any column vectors.

  4. Remove tempBinaryValVector field: No longer needed after the skipBinary rewrite.

Why are the changes needed?

The DELTA_BYTE_ARRAY decoder allocated 2 ByteBuffer objects per decoded value (one from in.slice() for the suffix, one from arrayData.getByteBuffer() for the previous-value reference). For a typical 4096-value page, this produced ~8K short-lived objects (~384KB) that stress the young-generation GC.

The skipBinary path was even worse: it reset and rebuilt column vectors per skipped value, involving reserve(), Arrays.copyOf(), and putArray() overhead just to maintain the previous-value state.

After this change, the hot path performs zero per-value heap allocations (the prevBuf array is reused across all values and only grows if a value exceeds its current capacity).

Normalized benchmark improvements (using Variant reads as cross-hardware baseline):

  • readBinary: 1.1-1.3x faster across overlap shapes
  • skipBinary: 1.5-1.9x faster (largest gains from eliminating column vector reset/swap)
  • readBinary(len) single-value: ~1.2x faster

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Existing tests: ParquetDeltaByteArrayEncodingSuite (14 tests), ParquetDeltaLengthByteArrayEncodingSuite, ParquetEncodingSuite -- all pass.
  • Benchmark: VectorizedDeltaReaderBenchmark Group C (DELTA_BYTE_ARRAY) with four overlap shapes (no overlap, half overlap, full overlap, len=64).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode with Claude Opus (claude-opus-4.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant