Implement AnyRee#9959
Conversation
f398606 to
d626ef8
Compare
d626ef8 to
b7e7b73
Compare
c352224 to
f601fc0
Compare
bbe594c to
931003b
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Sorry this is dragging out so long, came back with a fresh set of eyes. Just wanna ensure we're getting this correct from the start 😅
| /// Returns the run ends of this array as a primitive array. | ||
| fn run_ends(&self) -> ArrayRef; | ||
|
|
||
| /// Returns the values of this array. | ||
| fn values(&self) -> &Arc<dyn Array>; | ||
|
|
||
| /// Returns a new run-end encoded array with the given values, preserving the | ||
| /// existing run ends. | ||
| fn with_values(&self, values: ArrayRef) -> ArrayRef; | ||
| } |
There was a problem hiding this comment.
Now that I think of it, how do these APIs handle when the original RunArray is sliced? I think they return unsliced original data, and especially for run_ends it erases the logical slicing that may have been applied?
There was a problem hiding this comment.
im a little confused by what you mean. Do you have an example of this case?
There was a problem hiding this comment.
So if theres a case like this:
#[test]
fn test123() {
let run = Int32Array::from(vec![3, 6, 9]);
let values = Int32Array::from(vec![1, 2, 3]);
// [1, 1, 1, 2, 2, 2, 3, 3, 3]
let array = RunArray::try_new(&run, &values).unwrap();
// [2, 2, 2]
// But still references same underlying arrays as above
// Essentially:
// {1, 1, 1, [2, 2, 2], 3, 3, 3}
let array_sliced = array.slice(3, 3);
let new_values = Int32Array::from(vec![7, 8, 9]);
// [8, 8, 8]
let new_array_sliced = array_sliced.with_values(Arc::new(new_values));
}When we slice the array we still hold on to the original values array; so array_sliced.values() would return [1, 2, 3] even though array_sliced only has value of 2 repeating. This isn't too big an issue, as I think at most its just a performance thing. For example:
let values = date_part(array.values(), part)?;
let new_array = array.with_values(values);If array is sliced, then we could be doing extra work since we're calculating date_part for the values which aren't in the slice.
But for run_ends() method its harder to consider since we don't have a use case for it currently; is it a potential footgun that calling run_ends() on a sliced run array ignores any slicing 🤔
There was a problem hiding this comment.
looking at the example you provided, shouldn't let array_sliced = array.slice(3, 3); ideally be represented as run_ends:[3],values:[2] instead of
{run_ends: [3, 6, 9], values: PrimitiveArray<Int32> [ 1, 2, 3, ]}
which is currently what array.slice() produces. due to REE.slice() copying the values arrays.
I think the best approach for now would be to remove the run_ends() function and add it in later if there is a need.
Im not sure how we would avoid the extra work for with_values() without updating the .slice() method. But this would require looking at all the CallSites for slice first to make sure this doesn't cause breaking changes. In theory this shouldn't causes regressions though it should be an implementation detail.
Which issue does this PR close?
closes #9909.
AnyRunArraytrait #9909.Rationale for this change
makes the API simpler to work with & less code duplication
What changes are included in this PR?
Replace the per-key-type RunEndEncoded match arms in length/bit_length (arrow-string) and date_part (arrow-arith) with a single dispatch through the new
AsArray::as_any_ree_opt/as_any_reereturning &dyn AnyRunEndArray, mirroring the existing dictionary handling. This removes thenow-unused
ree_map!macro, leaving one trait-object code path for all Int16/Int32/Int64 run-end types.Are these changes tested?
yes
Are there any user-facing changes?
no