HIVE-29536: Stabilize rebalance compaction tests#6487
HIVE-29536: Stabilize rebalance compaction tests#6487InvisibleProgrammer wants to merge 1 commit into
Conversation
|
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for working on the fix! I've added some suggestions and requests for improving it.
| */ | ||
|
|
||
| int optimalRecordsInBucket = expectedData.size() / bucketCount; | ||
| int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1; |
There was a problem hiding this comment.
I think a different formula is needed here. For example, if we have 11 rows and 10 buckets, the optimal records in bucket would be 11/10 = 1. Then maximumRecordCountInABucket = 1 + 10 - 1 = 10. If one bucket contains 10 rows, then another bucket contains 1 row, and the remaining 8 buckets are empty. I would not call that distribution "balanced".
| .reduce(0, Integer::sum); | ||
|
|
||
| int optimalRecordsInBucket = allRecordCount / bucketCount; | ||
| int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1; |
There was a problem hiding this comment.
| TestDataProvider testDataProvider = prepareRebalanceTestData(tableName); | ||
|
|
||
| //Try to do a rebalancing compaction | ||
| executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'", driver); |
There was a problem hiding this comment.
Maybe add a comment that without explicit ORDER BY, there's an implicit order defined by the rebalance compaction query that fixes the order?
There was a problem hiding this comment.
It is a test case. I wonder how it helps understanding rebalance compaction or the test case itself.
It is an information that I would really like as part of a full description of the feature, in the class that responsible for doing the feature and in the official documentation.
What do you think?
|
|
||
| /* | ||
| check if the test data is unbalanced | ||
| balanced if all the buckets contains between n / bucket count and n / bucket count + bucket count rows |
There was a problem hiding this comment.
See comment https://github.com/apache/hive/pull/6487/changes#r3248007538.
Nit: contain (without s)
|
|
||
| // Assert that we have multiple buckets | ||
| List<String> bucketFilenames = CompactorTestUtil.getBucketFileNames(fs, table, null, "base_0000001"); | ||
| assertTrue(bucketFilenames.size() > 1); |
There was a problem hiding this comment.
This assert should be removed (we're checking for == 1 later).
| Assert.assertFalse(isBalanced(tableName, testDataProvider)); | ||
|
|
||
| // Please note, as the test tests rebalance compaction, not insert overwrite, it is not necessary to test if | ||
| // we have the exact same data after preparing the test data as we had at the source table. |
There was a problem hiding this comment.
I think it would still be helpful to check the data here, in order to catch problems from other parts of Hive early. Maybe simply check the number of records in the table.
| Assert.assertEquals(errorMessage, e.getCauseMessage()); | ||
| Assert.assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode()); | ||
| assertEquals(errorMessage, e.getCauseMessage()); | ||
| assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode()); |
There was a problem hiding this comment.
I would prefer to do refactors in a separate ticket/commit. The official guideline says "reformat code unrelated to the issue being fixed: formatting changes should be separate patches/commits;". While this is not strictly a formatting change, it feels quite close to it.
There was a problem hiding this comment.
Errr... Sure. Let me check this. I suppose I have to reduce my IDE's capabilities and it can be done. Honestly, I just removed some code blocks from this class. Those changes were handled by the IDE and I didn't even notice them.
However, I think the official guideline is a little bit obsolete: I'm pretty sure I saw a discussion somewhere about the opposite, like please do not do trivial changes like small reformats, formatting, etc to reduce the load on the precommit jobs.
There was a problem hiding this comment.
Hm, reducing the workload on the CI is indeed a valid concern. Is it possible to merge this PR with a rebase, instead of a squash? That way the import related changes could be move to a separate commit in the same PR. I wished Github had an interactive rebase feature for merging PRs.
I had a quick search, but couldn't find the discussion. Do you have a link by chance?
| assertTrue(expectedData.contains(rowData)); | ||
| expectedData.remove(rowData); |
There was a problem hiding this comment.
To avoid looking up the element twice, the return value of Set#remove() could be used.



Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:
...
There are two causes identified:
What changes were proposed in this pull request?
The goal of the change is to stabilize those tests by doing two things:
Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.
Why are the changes needed?
We experienced regular and serious regression issues due to the effect of the orc version number.
Does this PR introduce any user-facing change?
No
How was this patch tested?
With the existing tests.