[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields by yadavay-amzn · Pull Request #55928 · apache/spark

yadavay-amzn · 2026-05-17T05:51:44Z

What changes were proposed in this pull request?

getFieldByKey() uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically by key name. The Variant format spec allows unsorted objects (indicated by bit 4 of the object header). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for keys that exist.

Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted.

Why are the changes needed?

Data correctness bug -- getFieldByKey silently returns null for fields that exist in unsorted variant objects. This affects any variant data produced by external systems that do not sort field IDs.

Does this PR introduce any user-facing change?

Yes -- queries on variant columns with unsorted objects will now correctly return field values instead of null.

How was this patch tested?

Added test in VariantExpressionSuite that constructs a 32-field unsorted variant object (sort bit=0, field IDs in reverse order) and verifies getFieldByKey finds keys correctly. Test fails without the fix (binary search returns null), passes with it.

Was this patch authored or co-authored using generative AI tooling?

Yes.

steveloughran · 2026-05-18T11:48:53Z

The iceberg impl. sorts all input, irrespective of size, and then binary searches.

I'm wondering if that's a better approach.

Here every lookup on a large variant will, on average, take n/2 lookups. For k lookups, total cost of the operation is k * n / 2.

Java's timsort sort averages O(n * lg(n)), after which each binary search takes lg(n) operations. For k lookups, total cost is k * lg(n) + n * lg(n)

For sorting + binary to be better then

k * n / 2 > (k + n) * lg (n)

Unless I've got my numbers wrong, you're going to need large value of k before it's worth sorting.

special case, all values get looked up. The threshold is

n * n / 2 > n * 2 * lg(n)
n / 2 = 2 * lg(n)
n = 4 * lg(n)
n / lg(n) = 4

which can be solved for n = 16.

So we could say "sort if length > 16", but again, that's assuming every value is resolved, which we don't know in advance.

If only a few values are looked up, just linear scan them, as here, is the right thing to do.

To conclude: I think this simple check is the right strategy, unless and until more data is collected on what variant structures actually get used in production systems.

steveloughran · 2026-05-18T11:50:48Z

      final int BINARY_SEARCH_THRESHOLD = 32;
-      if (size < BINARY_SEARCH_THRESHOLD) {
+      int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK;
+      boolean sorted = ((typeInfo >> 5) & 0x1) != 0;


you could do (typeInfo & 0x10000) != 0 and keep the ALU barrel shifter idle. Or even do the same on L52

Applied the bitmask approach by using (typeInfo & 0x20) != 0 since typeInfo is 6 bits wide (masked by TYPE_INFO_MASK = 0x3F) and the sort bit is at position 5 within that. 0x20 = 1 << 5 = 32.

I think 0x10000 (65536) might be a typo since it exceeds the 6-bit range? Please let me know if I'm reading this wrong.

yadavay-amzn · 2026-05-18T17:07:32Z

@steveloughran Good analysis. The sort-then-binary-search approach is better when k lookups are expected on the same object. However, the Variant spec defines the sort bit specifically to avoid paying the sort cost on read, producers that sort at write time set the bit, and readers can binary search without re-sorting. For unsorted objects (sort bit = 0), linear scan is the safe fallback per spec.

A future optimization could sort-on-first-access and cache, but that changes the object's memory model (currently zero-copy over the binary buffer). Keeping it simple for now.

Will address the bitmask comment.

…ct fields getFieldByKey uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically. The Variant format spec allows unsorted objects (indicated by a header bit). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for existing keys. Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted. Closes #SPARK-56637

yadavay-amzn force-pushed the fix/SPARK-56637-variant-unsorted branch from afcb18f to 2768a21 Compare May 17, 2026 10:12

steveloughran reviewed May 18, 2026

View reviewed changes

yadavay-amzn force-pushed the fix/SPARK-56637-variant-unsorted branch from 2768a21 to 78e9183 Compare May 18, 2026 17:16

Retrigger CI (flaky OracleIntegrationSuite)

032ed35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928

[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:fix/SPARK-56637-variant-unsorted

yadavay-amzn commented May 17, 2026

Uh oh!

steveloughran commented May 18, 2026 •

edited

Loading

Uh oh!

steveloughran May 18, 2026

Uh oh!

yadavay-amzn May 18, 2026 •

edited

Loading

Uh oh!

yadavay-amzn commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yadavay-amzn commented May 17, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

steveloughran commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran May 18, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveloughran commented May 18, 2026 •

edited

Loading

yadavay-amzn May 18, 2026 •

edited

Loading

yadavay-amzn commented May 18, 2026 •

edited

Loading