sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950
sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950robobun wants to merge 4 commits into
Conversation
…/MySQL When a PooledPostgresConnection or PooledMySQLConnection's socket is closed in the same I/O tick as ReadyForQuery/handshake completion, #onClose runs synchronously before the queued on_connect microtask. The previous #onConnected blindly set state = connected and re-added the dead entry to readyConnections with this.connection === null. Subsequent queries dispatched null to Rust's PostgresSQLQuery.run and failed with 'connection must be a PostgresSQLConnection' forever — the pool never retried a conn it thought was live. Bail out of #onConnected when state !== pending. Same shape for MySQL. Fixes #30947
|
Updated 10:49 PM PT - May 17th, 2026
❌ @robobun, your commit 554c448 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 30950That installs a local version of the PR into your bun-30950 --bun |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds early-return guards in pooled connection onConnected handlers (MySQL and Postgres) to ignore stale callbacks when the connection state is no longer ChangesSQL pool connection state race condition
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Found 2 issues this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/sql/postgres-close-during-handshake.test.ts`:
- Around line 136-147: Read and assert stderr explicitly before checking payload
state and ensure the process exit code is asserted last: after you obtain
stdout, stderr, and exitCode, add expect(stderr).toBe(""); then parse the last
stdout line into parsed (as already done) and assert expect({ corrupted:
parsed.corrupted }).toEqual({ corrupted: false }); finally assert
expect(exitCode).toEqual(0). Use the existing variables (stdout, stderr,
exitCode, line, parsed) and keep the exit-code assertion as the final check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e1da5fe9-6dff-41eb-8f5e-2c3cd366a376
📒 Files selected for processing (1)
test/js/sql/postgres-close-during-handshake.test.ts
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
test/js/sql/postgres-close-during-handshake.test.ts:135-151— nit:stderris collected here but never asserted on or surfaced. If the fixture crashes before printing JSON, the test will fail with{corrupted: undefined, exitCode: <nonzero>}and the actual error will be silently discarded — consider foldingstderrinto thetoEqualobject (or logging it on failure) so CI failures are debuggable.Extended reasoning...
What the issue is
At lines 136-138 the test destructures
stderrfromPromise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]), butstderris never referenced again. The final assertion at lines 148-151 only checks{ corrupted: parsed.corrupted, exitCode }, so whatever the subprocess wrote to stderr is read into a local and dropped.How it manifests
The fixture is expected to print exactly one JSON line on stdout and exit 0. If anything goes wrong before that
console.log(JSON.stringify(...))— a panic in the native PostgresSQLConnection, an uncaught exception in thenet.createServersetup, an assertion failure in a debug build, or a future refactor that introduces a syntax error — the subprocess will write its diagnostic to stderr and exit non-zero, and stdout will be empty.stdout.trim().split("\n").at(-1)then yields"",parsedbecomes{}, and the assertion fails with:expect({ corrupted: undefined, exitCode: 1 }).toEqual({ corrupted: false, exitCode: 0 })That diff tells you that the subprocess crashed, but not why. The actual error (stack trace, panic message, ASAN report) was sitting in
stderrand got thrown away.Why nothing else catches this
The test pipes both stdout and stderr (
stdout: "pipe", stderr: "pipe"), so the subprocess's stderr does not inherit to the test runner's terminal — it is only visible if the test code explicitly surfaces it. There is no other assertion orconsole.error(stderr)path. In CI the only artifact is the jest diff, which contains none of the diagnostic.Convention
Per
test/CLAUDE.md, spawned-process tests in this repo assert on stderr (e.g.expect(stderr).toBe("")) before the exit-code check, or fold it into the snapshot object, precisely so the failure diff carries the real error. The pattern here matches every other field of that convention except this one.Step-by-step proof
- Suppose a future change to
src/js/internal/sql/postgres.tsregresses such thatnew SQL({...})throws synchronously (or the debug build hits anassert). - The fixture process writes the uncaught-exception stack to stderr and exits with code 1, never reaching
console.log. stdoutis"", soline = "",parsed = {},parsed.corrupted = undefined.stderrcontains the full stack trace, but is never read after destructuring.- The test fails with
{ corrupted: undefined, exitCode: 1 } != { corrupted: false, exitCode: 0 }— no hint that the failure was (say)TypeError: Cannot read properties of undefinedrather than the pool-corruption regression this test guards against.
Suggested fix
Fold stderr into the assertion so it appears in the diff:
expect({ stderr, corrupted: parsed.corrupted, exitCode }).toEqual({ stderr: "", corrupted: false, exitCode: 0, });
If the fixture is expected to emit some stderr (e.g. connection-closed warnings), an alternative is to keep the existing assertion but include
stdout/stderrin the object so they show up in the failure diff without being matched against, or simplyconsole.error(stderr)before theexpect.This is purely a debuggability/test-quality concern — it does not affect whether the test passes or fails for the bug being fixed, so filing as a nit.
- Suppose a future change to
There was a problem hiding this comment.
The guard looks correct and the race analysis is convincing, but this is a state-machine change in the production SQL connection pool (Postgres + MySQL) with subtle microtask-vs-sync-callback ordering, so it's worth a maintainer's eye — also note Build #55608 is showing failures.
Extended reasoning...
Overview
This PR adds a single early-return guard at the top of #onConnected in both PooledPostgresConnection (src/js/internal/sql/postgres.ts) and PooledMySQLConnection (src/js/internal/sql/mysql.ts): if this.state !== PooledConnectionState.pending, the callback is dropped. It also adds a ~150-line regression test (test/js/sql/postgres-close-during-handshake.test.ts) that spins up a fake Postgres TCP server which sends the full trust-mode handshake plus an admin-shutdown error and immediately FINs, to deterministically reproduce the race where #onClose runs synchronously before the queued #onConnected microtask drains.
Security risks
None. The guard is purely defensive against an internal callback-ordering race; it doesn't touch auth, TLS, query escaping, or any user-input handling. The test fixture binds to 127.0.0.1:0 in a subprocess.
Level of scrutiny
Medium-high. The runtime change is only ~3 lines of logic per adapter, and tracing the state machine confirms the only legitimate transitions into #onConnected are from pending (constructor and #doRetry() both set state = pending before #startConnection()), so the guard cannot drop a valid callback. #onClose already handles error propagation, onFinish resolution, and adapter.release, so dropping a stale #onConnected after it doesn't leak or hang anything. That said, this is the connection-pool state machine for Bun.SQL — a production-critical path where pool corruption manifests as permanent wedging — and the fix hinges on subtle reasoning about native Rust's queue_microtask(on_connect) vs synchronous on_close dispatch ordering. A maintainer familiar with the native side (e.g. the PostgresSQLConnection lifecycle) should confirm there's no path where on_connect is the only signal and on_close never follows.
Other factors
- robobun flagged failures in Build #55608 on the latest commit; I can't tell from here whether they're related or pre-existing flakes.
- The new test hand-rolls Postgres wire-protocol packets and relies on uSockets delivering handshake+FIN in one poll dispatch; worth a quick check that it's not flaky across platforms.
- coderabbit's only comment (stderr assertion) was addressed in e4c0989 and resolved.
- No CODEOWNERS cover these files.
Fixes #30947.
Repro
The issue's script, against local Postgres:
30–40% failure rate on a
max: 10pool when all connections are closedserver-side while the event loop is blocked (the repro blocks via
Bun.spawnSyncrunningpg_terminate_backend; in production this isa pooler's idle reaper, a network blip, or Supabase's session pooler).
The pool never recovers;
sql.close()afterwards spins at 100% CPU.Cause
PooledPostgresConnection.#onConnectedis wired to the nativePostgresSQLConnection'son_connectcallback, which Rust schedulesas a microtask (
queue_microtask) fromset_status(Connected)afterthe server sends
ReadyForQuery.#onCloseis the paired callback,which
fail_with_js_valuefires synchronously.When the server's FIN lands in the same I/O tick as
ReadyForQuery(uSockets' poll returns
READABLE | EPOLLHUPin one wake, common onloopback or when a pooler closes many idle conns in a batch), uSockets
dispatches
us_dispatch_datathenus_dispatch_endin the sameus_internal_dispatch_ready_pollcall:on_dataparsesReadyForQuery→queue_microtask(on_connect).On
event_loop.exit()microtasks drain,#onConnectedwould run.ErrorResponsetoo, soon_datacallsfail— which runson_closesynchronously.#onClosenullsthis.connection,sets
state = closed, removes fromreadyConnections.on_connectcallbackfrom step 1. When it fires,
#onConnectedblindly overwritesstate = connectedand re-adds the entry toreadyConnections—but
this.connectionis stillnull.The next query sees this ghost in
readyConnections.size > 0,flushConcurrentQueriesdispatchesonQueryConnected(null, pooledConn),and
handle.run(pooledConn.connection /* null */, query)trips Rust'sfrom_js_refguard:#doRetryonly runs for conns withstate === closed, so a ghostin
connectedis never refreshed — the pool is permanently wedged.Fix
#onConnectedrefuses stale transitions:The only legitimate caller is the initial connect /
#doRetrypath,both of which set
state = pendingright before invoking#startConnection. Anything else is a racing microtask from aPostgresSQLConnection whose
on_closealready ran — those have to bedropped on the floor, not promoted back into
readyConnections.Same race exists in
PooledMySQLConnection(MySQL uses the samestate-machine shape and the same
on_connectmicrotask pattern) so theguard is applied to both adapters.
Verification
test/js/sql/postgres-close-during-handshake.test.ts— fake TCP serverthat serves the full trust-mode handshake (AuthOk + ParameterStatus
stack + BackendKeyData + ReadyForQuery + admin-shutdown ErrorResponse)
and FINs every socket. Runs under plain
bun bd test— no Docker.Without the fix, the fixture's first iteration throws
"connection must be a PostgresSQLConnection" and
corrupted: truegets printed; the test expects
corrupted: falseand fails. Withthe fix, the pool keeps reconnecting cleanly through 20 iterations
and the test passes in ~3.5s.
Verified:
bun,real Postgres)
(
bun bd testASAN debug)test/js/sql/*tests unaffected