Skip to content

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle#2

Open
dnovitski wants to merge 3 commits into
masterfrom
perf/parallel-rowcopy
Open

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle#2
dnovitski wants to merge 3 commits into
masterfrom
perf/parallel-rowcopy

Conversation

@dnovitski
Copy link
Copy Markdown
Owner

@dnovitski dnovitski commented Apr 29, 2026

Performance Optimizations for gh-ost

Note: This PR incorporates and supersedes the changes from #1 (multithreaded replication data inconsistency fix).

This PR adds several performance optimizations to gh-ost that significantly speed up row-copy under high write load while keeping binlog lag bounded.

Features

1. Parallel Row-Copy (inspired by feat concurrent chunk data #1398)

  • --copy-concurrency=N — parallel row-copy workers (default 1)
  • Bounded drain budget gives row-copy more execution turns instead of blocking indefinitely on DML drain

2. DML Event Merging (inspired by feat binlog apply optimization #1378)

  • Merges redundant DML events for the same row before applying
  • Under high write load, reduces applied statements by ~36%
  • Example: INSERT + UPDATE + UPDATE → single INSERT with final values
  • Disable with --skip-dml-merge

3. Frontier Filter (inspired by feat binlog apply optimization #1378)

  • Skips DML events for rows not yet copied (row-copy will capture latest value)
  • Reduces redundant work during the copy phase
  • Only active when --copy-concurrency=1 (single-copy): with parallel copy, multiple chunks are in-flight simultaneously so the frontier position is not a reliable boundary — in-flight chunks may not have committed yet, making it unsafe to skip events beyond the frontier
  • Automatically disabled in replica modes (TestOnReplica/MigrateOnReplica): in replica mode, binlog events are read from the replica's relay log and may be ahead of the SQL thread's apply position — row-copy SELECT queries may not yet see the data from skipped events, causing silent data loss
  • Disable with --skip-dml-frontier-filter

4. Heartbeat Lag Throttle

  • --copy-max-lag-millis (default 60000) prevents unbounded binlog lag growth during parallel row-copy
  • When HeartbeatLag exceeds threshold, pauses row-copy and drains exclusively
  • Resumes at threshold/2 (hysteresis prevents oscillation)
  • Set to 0 to disable (maximum copy speed, unbounded lag)
  • See documentation for detailed comparison with --max-lag-millis

Runtime-Changeable Flags

  • copy-concurrency=<N> — change parallel copy workers at runtime (range 1-32)
  • copy-max-lag-millis=<N> — change heartbeat lag threshold at runtime (0 = disable)
  • See interactive commands documentation for usage

Bug Fixes

  • Fixed buildDMLEventQuery DML mutation: UPDATE operations on unique-key tables no longer corrupt the shared DMLEvent object
  • Fixed frontier filter race in replica mode: Events read from binlog could be ahead of replica SQL thread position, causing missed changes
  • Fixed copy starvation with parallel row-copy: Unbuffered copyRowsQueue channel combined with HeartbeatLag sentinel value (before first heartbeat) caused copy to never get execution turns. Fixed with buffered channel and sentinel filtering

Benchmark Results (4-thread sysbench, 100K rows, 15-min runs)

Configuration Copy Time Max HeartbeatLag DML Events/sec Result
All features (no throttle) 23s 207s ⚠️ 983 PASS
All features + lag throttle (60s) 41s ~55s ~950 PASS
No DML merge 71s 262s 722 PASS
No frontier filter 28s 200s 970 PASS
Single-copy baseline 10m47s 6.6s 905 PASS

Key takeaways:

  • Parallel copy with throttle: 16x faster than single-copy baseline (41s vs 10m47s)
  • HeartbeatLag stays bounded at ~55s (vs 207s without throttle)
  • DML merge provides ~36% more events/sec throughput
  • All configurations pass data consistency checks (row counts, NULL PKs, duplicate PKs, checksums)

HeartbeatLag Analysis

Without the throttle, binlog lag grows unboundedly because the bounded drain (50ms budget) gives row-copy more turns at the expense of DML processing. The lag throttle resolves this:

  • During copy phase: lag may briefly reach threshold (~55-60s), then copy pauses
  • During throttle pause: exclusive DML drain brings lag back to ~30s (threshold/2)
  • After copy completes: DML catch-up drains remaining lag to 0 within minutes
  • At cutover: lag is always near 0 (normal gh-ost cutover behavior)

New CLI Flags

Flag Default Description
--copy-max-lag-millis 60000 Max heartbeat lag before throttling row-copy (0 = disabled)
--skip-dml-merge false Disable DML event merging (for benchmarking)
--skip-dml-frontier-filter false Disable frontier filter optimization (for benchmarking)

Testing

  • All existing integration tests pass (MySQL 5.7, 8.0, 8.4, Percona 8.0)
  • New integration test for DML event merging (merge-dml-events)
  • New integration test for parallel row-copy with lag throttle (parallel-rowcopy-lag-throttle)
  • Unit tests for runtime-changeable flag commands (12 test cases)
  • 15-minute sysbench consistency tests under 4-thread concurrent write load
  • Data consistency validated: row counts, NULL PKs, unique PKs, checksums

@dnovitski dnovitski changed the title perf: parallel row-copy with dedicated connection pool and time-bounded drain perf: parallel row-copy, DML event merging, and adaptive drain Apr 29, 2026
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 8b0acb3 to 9ab008d Compare April 29, 2026 09:25
@dnovitski dnovitski changed the title perf: parallel row-copy, DML event merging, and adaptive drain perf: Parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Apr 29, 2026
jakubpliszka and others added 2 commits April 29, 2026 11:33
…binlog sentinel (github#1637)

* Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel

Buffer allEventsUpToLockProcessed to MaxRetries() so the applier's send always
completes immediately even after waitForEventsUpToLock has timed out and exited.

---------

Co-authored-by: meiji163 <meiji163@github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch 3 times, most recently from dd5dfd9 to 8a5b648 Compare April 29, 2026 20:12
@dnovitski dnovitski closed this Apr 29, 2026
@dnovitski dnovitski reopened this Apr 29, 2026
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 8a5b648 to 3110d30 Compare April 29, 2026 20:22
@dnovitski dnovitski changed the base branch from mtr-squashed to master April 29, 2026 20:23
…ttle (#2)

Performance optimizations for gh-ost that significantly speed up row-copy
under high write load while keeping binlog lag bounded:

- Parallel row-copy with dedicated connection pool and time-bounded drain
- DML event merging within batches (INSERT/DELETE cancellation, UPDATE folding)
- Frontier filter to skip DML events beyond copy frontier
- Heartbeat lag throttle (--copy-max-lag-millis) for row-copy pacing
- Adaptive drain budget and auto-tuning chunk size
- Runtime-changeable --copy-concurrency and --copy-max-lag-millis
- Fix multithreaded replication data inconsistency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 3110d30 to ca7577a Compare April 29, 2026 20:35
@dnovitski dnovitski changed the title perf: Parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants