[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928
[SPARK-56637][CORE] Fix Variant getFieldByKey to handle unsorted object fields#55928yadavay-amzn wants to merge 2 commits into
Conversation
afcb18f to
2768a21
Compare
|
The iceberg impl. sorts all input, irrespective of size, and then binary searches. I'm wondering if that's a better approach. Here every lookup on a large variant will, on average, take Java's timsort sort averages For sorting + binary to be better then Unless I've got my numbers wrong, you're going to need large value of k before it's worth sorting. special case, all values get looked up. The threshold is which can be solved for n = 16. So we could say "sort if length > 16", but again, that's assuming every value is resolved, which we don't know in advance. If only a few values are looked up, just linear scan them, as here, is the right thing to do. To conclude: I think this simple check is the right strategy, unless and until more data is collected on what variant structures actually get used in production systems. |
| final int BINARY_SEARCH_THRESHOLD = 32; | ||
| if (size < BINARY_SEARCH_THRESHOLD) { | ||
| int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK; | ||
| boolean sorted = ((typeInfo >> 5) & 0x1) != 0; |
There was a problem hiding this comment.
you could do (typeInfo & 0x10000) != 0 and keep the ALU barrel shifter idle. Or even do the same on L52
There was a problem hiding this comment.
Applied the bitmask approach by using (typeInfo & 0x20) != 0 since typeInfo is 6 bits wide (masked by TYPE_INFO_MASK = 0x3F) and the sort bit is at position 5 within that. 0x20 = 1 << 5 = 32.
I think 0x10000 (65536) might be a typo since it exceeds the 6-bit range? Please let me know if I'm reading this wrong.
|
@steveloughran Good analysis. The sort-then-binary-search approach is better when A future optimization could sort-on-first-access and cache, but that changes the object's memory model (currently zero-copy over the binary buffer). Keeping it simple for now. Will address the bitmask comment. |
…ct fields getFieldByKey uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically. The Variant format spec allows unsorted objects (indicated by a header bit). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for existing keys. Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted. Closes #SPARK-56637
2768a21 to
78e9183
Compare
What changes were proposed in this pull request?
getFieldByKey()uses binary search for objects with >=32 fields, assuming field IDs are sorted alphabetically by key name. The Variant format spec allows unsorted objects (indicated by bit 4 of the object header). External producers (Parquet, Iceberg) may produce unsorted variants, causing binary search to silently return null for keys that exist.Fix: check the object header sort bit before choosing binary search vs linear scan. Fall back to linear scan when fields are unsorted.
Why are the changes needed?
Data correctness bug --
getFieldByKeysilently returns null for fields that exist in unsorted variant objects. This affects any variant data produced by external systems that do not sort field IDs.Does this PR introduce any user-facing change?
Yes -- queries on variant columns with unsorted objects will now correctly return field values instead of null.
How was this patch tested?
Added test in
VariantExpressionSuitethat constructs a 32-field unsorted variant object (sort bit=0, field IDs in reverse order) and verifiesgetFieldByKeyfinds keys correctly. Test fails without the fix (binary search returns null), passes with it.Was this patch authored or co-authored using generative AI tooling?
Yes.