HIVE-29598: Fix vectorized outer join wrong results due to stale scratch column values#6486
Conversation
…tch column values
soumyakanti3578
left a comment
There was a problem hiding this comment.
Vectorization code can be tricky and brittle. To ensure there are no unintended consequences of this change, could you please add some tests? A minimal reproducer as a qtest is essential, and unit tests for the vector clearing operation across different ColumnVector types would also be valuable.
|
@ryukobayashi could you please take a look at my suggestions to your approach: ryukobayashi#1 ? The PR is a bit better reproducing the actual problem and incorporates multiple suggestions from @soumyakanti3578 |
HIVE-29598 follow-up: virtualize clearValue + replace qtest
|
@konstantinb Thanks, I accept your proposal because it makes sense. |
|
@ryukobayashi one more tweak: ryukobayashi#2 which addresses SQ feedback and trims excessive comments/hive config flags the test uses. The main module still seems to have at least three code paths unfixed, but those are, apparently, impossible to hit with current CBO rewrites. I am unsure if we should leave those as they are now or fix preemptively |
HIVE-29598: SQ feedback + comments' cleanup
|
@konstantinb Merged. If the corrections are minor, I think it's okay to do them just to be safe. |
…VectorMapJoinOuterGenerateResultOperator.java Co-authored-by: konstantinb <konstantinb@users.noreply.github.com>
…VectorMapJoinOuterGenerateResultOperator.java Co-authored-by: konstantinb <konstantinb@users.noreply.github.com>
…rameterized tests; add generateOuterNullsRepeatedAll parameterized test; verify neighbor slot unchanged in TestBytesColumnVector
|



What changes were proposed in this pull request?
In vectorized outer join,
generateOuterNulls()andgenerateOuterNullsRepeatedAll()setisNull[i] = trueon scratch columns but leavevector[i]untouched. Whenhive.vectorized.reuse.scratch.columns=true(the default), a scratch column slotfreed after an expression evaluation (e.g.
CastStringToLong) can be reused for the outer join's null-marking column. Afterreset()clearsisNull[], the expression overwritesvector[i]with a fresh value (e.g. 2025). Later,generateOuterNulls()sets
isNull[i] = truewithout clearingvector[i], leaving a stale non-zero value.Downstream operators such as
ColOrColreadvector[i]directly to distinguish "false" (== 0) from "null" (!= 0). The stale value causes null rows to be misinterpreted as "true", producing wrong OR/AND/CASE WHEN results.The fix adds
clearVectorValue(), called wheneverisNull[i]is set totruein the outer join null-marking paths, zeroingvector[i]for all supported column vector types (LongColumnVector,DoubleColumnVector,BytesColumnVector,TimestampColumnVector,IntervalDayTimeColumnVector).Why are the changes needed?
Without the fix, vectorized outer joins silently return wrong results when scratch column reuse is enabled (the default). The bug is non-obvious because it only triggers when a specific combination of conditions is met: a type-casting expression allocates a scratch column that is later reused for the outer join's null-marking column, and the join result is consumed by a boolean operator that reads the raw vector value for null discrimination. Users have no indication that results are wrong; workarounds require disabling vectorization entirely (
hive.vectorization.enabled=false) or disabling scratch column reuse (hive.vectorized.reuse.scratch.columns=false), both of which carry a significant performance cost.Does this PR introduce any user-facing change?
No
How was this patch tested?
I added qtest.