Skip to content

[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928

Open
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:fix/SPARK-56637-variant-unsorted
Open

[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:fix/SPARK-56637-variant-unsorted

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

getFieldByKey() uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically by key name. The Variant format spec allows unsorted objects (indicated by bit 4 of the object header). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for keys that exist.

Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted.

Why are the changes needed?

Data correctness bug -- getFieldByKey silently returns null for fields that exist in unsorted variant objects. This affects any variant data produced by external systems that do not sort field IDs.

Does this PR introduce any user-facing change?

Yes -- queries on variant columns with unsorted objects will now correctly return field values instead of null.

How was this patch tested?

Added test in VariantExpressionSuite that constructs a 32-field unsorted variant object (sort bit=0, field IDs in reverse order) and verifies getFieldByKey finds keys correctly. Test fails without the fix (binary search returns null), passes with it.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56637-variant-unsorted branch from afcb18f to 2768a21 Compare May 17, 2026 10:12
@steveloughran
Copy link
Copy Markdown
Contributor

steveloughran commented May 18, 2026

The iceberg impl. sorts all input, irrespective of size, and then binary searches.

I'm wondering if that's a better approach.

Here every lookup on a large variant will, on average, take n/2 lookups. For k lookups, total cost of the operation is k * n / 2.

Java's timsort sort averages O(n * lg(n)), after which each binary search takes lg(n) operations. For k lookups, total cost is k * lg(n) + n * lg(n)

For sorting + binary to be better then

k * n / 2 > (k + n) * lg (n)

Unless I've got my numbers wrong, you're going to need large value of k before it's worth sorting.

special case, all values get looked up. The threshold is

n * n / 2 > n * 2 * lg(n)
n / 2 = 2 * lg(n)
n = 4 * lg(n)
n / lg(n) = 4

which can be solved for n = 16.

So we could say "sort if length > 16", but again, that's assuming every value is resolved, which we don't know in advance.

If only a few values are looked up, just linear scan them, as here, is the right thing to do.

To conclude: I think this simple check is the right strategy, unless and until more data is collected on what variant structures actually get used in production systems.

final int BINARY_SEARCH_THRESHOLD = 32;
if (size < BINARY_SEARCH_THRESHOLD) {
int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK;
boolean sorted = ((typeInfo >> 5) & 0x1) != 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do (typeInfo & 0x10000) != 0 and keep the ALU barrel shifter idle. Or even do the same on L52

Copy link
Copy Markdown
Contributor Author

@yadavay-amzn yadavay-amzn May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the bitmask approach by using (typeInfo & 0x20) != 0 since typeInfo is 6 bits wide (masked by TYPE_INFO_MASK = 0x3F) and the sort bit is at position 5 within that. 0x20 = 1 << 5 = 32.

I think 0x10000 (65536) might be a typo since it exceeds the 6-bit range? Please let me know if I'm reading this wrong.

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented May 18, 2026

@steveloughran Good analysis. The sort-then-binary-search approach is better when k lookups are expected on the same object. However, the Variant spec defines the sort bit specifically to avoid paying the sort cost on read, producers that sort at write time set the bit, and readers can binary search without re-sorting. For unsorted objects (sort bit = 0), linear scan is the safe fallback per spec.

A future optimization could sort-on-first-access and cache, but that changes the object's memory model (currently zero-copy over the binary buffer). Keeping it simple for now.

Will address the bitmask comment.

…ct fields

getFieldByKey uses binary search for objects with >=32 fields,
assuming field IDs are sorted alphabetically. The Variant format
spec allows unsorted objects (indicated by a header bit). External
producers (Parquet, Iceberg) may produce unsorted variants, causing
binary search to silently return null for existing keys.

Fix: check the object header sort bit before choosing binary search
vs linear scan. Fall back to linear scan when fields are unsorted.

Closes #SPARK-56637
@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56637-variant-unsorted branch from 2768a21 to 78e9183 Compare May 18, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants