Skip to content

[BugFix] Include underlying cause in HdfsFsManager copy error messages#73414

Open
xiangguangyxg wants to merge 3 commits into
StarRocks:mainfrom
xiangguangyxg:fix-cluster-snapshot-upload-error-msg
Open

[BugFix] Include underlying cause in HdfsFsManager copy error messages#73414
xiangguangyxg wants to merge 3 commits into
StarRocks:mainfrom
xiangguangyxg:fix-cluster-snapshot-upload-error-msg

Conversation

@xiangguangyxg
Copy link
Copy Markdown
Contributor

@xiangguangyxg xiangguangyxg commented May 18, 2026

Why I'm doing:

HdfsFsManager.copyFromLocal / copyToLocal wrap the underlying Hadoop exception in a StarRocksException but only put the source / destination paths into the new message, dropping the cause's message.

Callers that surface only StarRocksException#getMessage to the user can lose all useful diagnostic information. The motivating case is automated cluster snapshot upload to remote object storage:

ClusterSnapshotCheckpointScheduler.java
    errMsg = "upload image failed, err msg: " + e.getMessage();

When the underlying S3/OSS upload fails (AccessDenied, NoSuchBucket, SocketTimeoutException, SignatureDoesNotMatch, FileAlreadyExistsException, etc.), the user only sees the following in information_schema.cluster_snapshot_jobs.error_message:

upload image failed, err msg: Failed to copy local /opt/starrocks/fe/meta/image to s3://tenant-bucket/.../meta/image/automated_cluster_snapshot_xxx

…with no hint of the real reason. Operators currently have to grep fe.log for the Exception while copy local … stack trace to find what actually went wrong.

What I'm doing:

  • Append e.getMessage() to the StarRocksException wrapper message in both copyFromLocal and copyToLocal (for both InterruptedIOException and the generic Exception paths), so the real cause is visible at the call site without losing the existing wrapper for log greppability.
  • Drive-by: fix the missing space before to local in copyToLocal's error string (previously produced Failed to copy /pathto local /dest).
  • Add unit tests testCopyToLocalIncludesCauseMessage and testCopyFromLocalIncludesCauseMessage covering both methods.

The underlying cause is already attached to the StarRocksException and still gets logged with full stack trace via LOG.error — this change only adds the cause's message to the user-visible message string.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

copyFromLocal/copyToLocal wrap the underlying Hadoop exception in a
StarRocksException but only put the raw paths into the new message,
dropping the cause's message. Callers that surface only
StarRocksException#getMessage (e.g. ClusterSnapshotJob.error_message
shown via information_schema.cluster_snapshot_jobs) therefore see
"Failed to copy local /opt/starrocks/fe/meta/image to s3://..."
without any hint of the real reason (AccessDenied, NoSuchBucket,
SocketTimeoutException, etc.), forcing operators to dig into fe.log
to find the cause.

Append the cause's message to the wrapper message so the real reason
is visible at the call site too. Also fix the missing space before
"to local" in copyToLocal's error string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

@github-actions github-actions Bot requested a review from srlch May 18, 2026 03:32
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Cover both copyToLocal and copyFromLocal: the wrapper
StarRocksException's message must include the underlying cause's
message so callers that only surface getMessage() still see the
real failure reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fe Cover Checker flagged lines 1222 and 1238 (the InterruptedIOException
catch blocks of copyToLocal/copyFromLocal) as uncovered. Add tests that
inject InterruptedIOException through Mockito so the wrapper message
formatting in those branches is exercised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions
Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 4 / 4 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/fs/hdfs/HdfsFsManager.java 4 4 100.00% []

@github-actions
Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants