sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL by robobun · Pull Request #30950 · oven-sh/bun

robobun · 2026-05-17T23:18:23Z

Repro

The issue's script, against local Postgres:

$ DATABASE_URL=postgres://… bun bug.ts
[killed] {"n":"10"}
[q0] {"err":"connection must be a PostgresSQLConnection"}
[q1] {"err":"connection must be a PostgresSQLConnection"}
[q2] {"err":"connection must be a PostgresSQLConnection"}
[q3] {"err":"connection must be a PostgresSQLConnection"}
[q4] {"err":"connection must be a PostgresSQLConnection"}

30–40% failure rate on a max: 10 pool when all connections are closed
server-side while the event loop is blocked (the repro blocks via
Bun.spawnSync running pg_terminate_backend; in production this is
a pooler's idle reaper, a network blip, or Supabase's session pooler).
The pool never recovers; sql.close() afterwards spins at 100% CPU.

Cause

PooledPostgresConnection.#onConnected is wired to the native
PostgresSQLConnection's on_connect callback, which Rust schedules
as a microtask (queue_microtask) from set_status(Connected) after
the server sends ReadyForQuery. #onClose is the paired callback,
which fail_with_js_value fires synchronously.

When the server's FIN lands in the same I/O tick as ReadyForQuery
(uSockets' poll returns READABLE | EPOLLHUP in one wake, common on
loopback or when a pooler closes many idle conns in a batch), uSockets
dispatches us_dispatch_data then us_dispatch_end in the same
us_internal_dispatch_ready_poll call:

on_data parses ReadyForQuery → queue_microtask(on_connect).
On event_loop.exit() microtasks drain, #onConnected would run.
But then the handshake response carried the admin-shutdown
ErrorResponse too, so on_data calls fail — which runs
on_close synchronously. #onClose nulls this.connection,
sets state = closed, removes from readyConnections.
Pre-drain the microtask queue still holds the on_connect callback
from step 1. When it fires, #onConnected blindly overwrites
state = connected and re-adds the entry to readyConnections —
but this.connection is still null.

The next query sees this ghost in readyConnections.size > 0,
flushConcurrentQueries dispatches onQueryConnected(null, pooledConn),
and handle.run(pooledConn.connection /* null */, query) trips Rust's
from_js_ref guard:

// src/sql_jsc/postgres/PostgresSQLQuery.rs:536
let Some(connection) = postgres_sql_connection::js::from_js_ref(arguments[0]) else {
    return Err(global_object.throw(format_args!("connection must be a PostgresSQLConnection")));
};

#doRetry only runs for conns with state === closed, so a ghost
in connected is never refreshed — the pool is permanently wedged.

Fix

#onConnected refuses stale transitions:

if (this.state !== PooledConnectionState.pending) {
  return;
}

The only legitimate caller is the initial connect / #doRetry path,
both of which set state = pending right before invoking
#startConnection. Anything else is a racing microtask from a
PostgresSQLConnection whose on_close already ran — those have to be
dropped on the floor, not promoted back into readyConnections.

Same race exists in PooledMySQLConnection (MySQL uses the same
state-machine shape and the same on_connect microtask pattern) so the
guard is applied to both adapters.

Verification

test/js/sql/postgres-close-during-handshake.test.ts — fake TCP server
that serves the full trust-mode handshake (AuthOk + ParameterStatus
stack + BackendKeyData + ReadyForQuery + admin-shutdown ErrorResponse)
and FINs every socket. Runs under plain bun bd test — no Docker.

Without the fix, the fixture's first iteration throws
"connection must be a PostgresSQLConnection" and corrupted: true
gets printed; the test expects corrupted: false and fails. With
the fix, the pool keeps reconnecting cleanly through 20 iterations
and the test passes in ~3.5s.

Verified:

repro script fails ~30% without fix, 0% with fix (release bun,
real Postgres)
fake-server test reliably fails without fix, passes with fix
(bun bd test ASAN debug)
existing test/js/sql/* tests unaffected

…/MySQL When a PooledPostgresConnection or PooledMySQLConnection's socket is closed in the same I/O tick as ReadyForQuery/handshake completion, #onClose runs synchronously before the queued on_connect microtask. The previous #onConnected blindly set state = connected and re-added the dead entry to readyConnections with this.connection === null. Subsequent queries dispatched null to Rust's PostgresSQLQuery.run and failed with 'connection must be a PostgresSQLConnection' forever — the pool never retried a conn it thought was live. Bail out of #onConnected when state !== pending. Same shape for MySQL. Fixes #30947

robobun · 2026-05-17T23:18:33Z

^{Updated 10:49 PM PT - May 17th, 2026}

❌ @robobun, your commit 554c448 has 1 failures in Build #55613 (All Failures):

test/js/bun/util/v8-heap-snapshot.test.ts - SIGKILL on 🐧 25.04 x64

🧪 To try this PR locally:

bunx bun-pr 30950

That installs a local version of the PR into your bun-30950 executable, so you can run:

bun-30950 --bun

coderabbitai · 2026-05-17T23:20:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd6797ef-0da3-4b65-b0ea-c8174d561579

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb22a7 and e4c0989.

📒 Files selected for processing (1)

test/js/sql/postgres-close-during-handshake.test.ts

Walkthrough

Adds early-return guards in pooled connection onConnected handlers (MySQL and Postgres) to ignore stale callbacks when the connection state is no longer pending, and adds a Postgres regression test that simulates a server close during handshake to validate the fix.

Changes

SQL pool connection state race condition

Layer / File(s)	Summary
Connection state guards in handlers `src/js/internal/sql/mysql.ts`, `src/js/internal/sql/postgres.ts`	`PooledMySQLConnection.#onConnected` and `PooledPostgresConnection.#onConnected` add early-return checks to ensure the pooled connection is still `pending` before processing the ReadyForQuery/on_connect callback; if not `pending`, the callback returns early.
Regression test for mid-handshake close `test/js/sql/postgres-close-during-handshake.test.ts`	Adds a fixture script and test that spawn a fake Postgres server which completes the startup handshake then immediately sends an admin-shutdown and closes. The test runs multiple pooled `SELECT 1` queries and asserts the pool is not corrupted and the fixture exits successfully.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding guards to ignore stale on_connect callbacks after `#onClose` in pooled Postgres/MySQL connections.
Description check	✅ Passed	The description is comprehensive with clear sections: Repro, Cause, Fix, and Verification. It thoroughly explains the issue, root cause, solution, and testing approach.
Linked Issues check	✅ Passed	The changes directly address issue `#30947` by adding state checks to prevent stale on_connect callbacks from corrupting the connection pool when connections close server-side during handshake.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the connection pool corruption issue: guards in Postgres/MySQL handlers and a regression test for the specific failure scenario.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-17T23:22:15Z

Found 2 issues this PR may fix:

Postgres connection leak #23215 - Unbounded connection growth beyond configured max is consistent with the race where #onConnected re-adds dead connections to readyConnections, causing the pool to create replacements without accounting for corrupted entries
Crash in createInstance in Postgres client in Bun.SQL #24434 - Crash in PostgresSQLConnection__createInstance triggered from a microtask matches the race where a pending #onConnected microtask fires after #onClose has already nulled this.connection

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #23215
Fixes #24434

🤖 Generated with Claude Code

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/sql/postgres-close-during-handshake.test.ts`:
- Around line 136-147: Read and assert stderr explicitly before checking payload
state and ensure the process exit code is asserted last: after you obtain
stdout, stderr, and exitCode, add expect(stderr).toBe(""); then parse the last
stdout line into parsed (as already done) and assert expect({ corrupted:
parsed.corrupted }).toEqual({ corrupted: false }); finally assert
expect(exitCode).toEqual(0). Use the existing variables (stdout, stderr,
exitCode, line, parsed) and keep the exit-code assertion as the final check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e1da5fe9-6dff-41eb-8f5e-2c3cd366a376

📥 Commits

Reviewing files that changed from the base of the PR and between 2edc9e4 and 4cb22a7.

📒 Files selected for processing (1)

test/js/sql/postgres-close-during-handshake.test.ts

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 test/js/sql/postgres-close-during-handshake.test.ts:135-151 — nit: stderr is collected here but never asserted on or surfaced. If the fixture crashes before printing JSON, the test will fail with {corrupted: undefined, exitCode: <nonzero>} and the actual error will be silently discarded — consider folding stderr into the toEqual object (or logging it on failure) so CI failures are debuggable.
Extended reasoning...

What the issue is

At lines 136-138 the test destructures stderr from Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]), but stderr is never referenced again. The final assertion at lines 148-151 only checks { corrupted: parsed.corrupted, exitCode }, so whatever the subprocess wrote to stderr is read into a local and dropped.

How it manifests

The fixture is expected to print exactly one JSON line on stdout and exit 0. If anything goes wrong before that console.log(JSON.stringify(...)) — a panic in the native PostgresSQLConnection, an uncaught exception in the net.createServer setup, an assertion failure in a debug build, or a future refactor that introduces a syntax error — the subprocess will write its diagnostic to stderr and exit non-zero, and stdout will be empty. stdout.trim().split("\n").at(-1) then yields "", parsed becomes {}, and the assertion fails with:
```
expect({ corrupted: undefined, exitCode: 1 }).toEqual({ corrupted: false, exitCode: 0 })
```
That diff tells you that the subprocess crashed, but not why. The actual error (stack trace, panic message, ASAN report) was sitting in stderr and got thrown away.

Why nothing else catches this

The test pipes both stdout and stderr (stdout: "pipe", stderr: "pipe"), so the subprocess's stderr does not inherit to the test runner's terminal — it is only visible if the test code explicitly surfaces it. There is no other assertion or console.error(stderr) path. In CI the only artifact is the jest diff, which contains none of the diagnostic.

Convention

Per test/CLAUDE.md, spawned-process tests in this repo assert on stderr (e.g. expect(stderr).toBe("")) before the exit-code check, or fold it into the snapshot object, precisely so the failure diff carries the real error. The pattern here matches every other field of that convention except this one.

Step-by-step proof
1. Suppose a future change to src/js/internal/sql/postgres.ts regresses such that new SQL({...}) throws synchronously (or the debug build hits an assert).
2. The fixture process writes the uncaught-exception stack to stderr and exits with code 1, never reaching console.log.
3. stdout is "", so line = "", parsed = {}, parsed.corrupted = undefined.
4. stderr contains the full stack trace, but is never read after destructuring.
5. The test fails with { corrupted: undefined, exitCode: 1 } != { corrupted: false, exitCode: 0 } — no hint that the failure was (say) TypeError: Cannot read properties of undefined rather than the pool-corruption regression this test guards against.
Suggested fix

Fold stderr into the assertion so it appears in the diff:
```
expect({ stderr, corrupted: parsed.corrupted, exitCode }).toEqual({
  stderr: "",
  corrupted: false,
  exitCode: 0,
});
```
If the fixture is expected to emit some stderr (e.g. connection-closed warnings), an alternative is to keep the existing assertion but include stdout/stderr in the object so they show up in the failure diff without being matched against, or simply console.error(stderr) before the expect.

This is purely a debuggability/test-quality concern — it does not affect whether the test passes or fails for the bug being fixed, so filing as a nit.

claude

The guard looks correct and the race analysis is convincing, but this is a state-machine change in the production SQL connection pool (Postgres + MySQL) with subtle microtask-vs-sync-callback ordering, so it's worth a maintainer's eye — also note Build #55608 is showing failures.

Extended reasoning...

Overview

This PR adds a single early-return guard at the top of #onConnected in both PooledPostgresConnection (src/js/internal/sql/postgres.ts) and PooledMySQLConnection (src/js/internal/sql/mysql.ts): if this.state !== PooledConnectionState.pending, the callback is dropped. It also adds a ~150-line regression test (test/js/sql/postgres-close-during-handshake.test.ts) that spins up a fake Postgres TCP server which sends the full trust-mode handshake plus an admin-shutdown error and immediately FINs, to deterministically reproduce the race where #onClose runs synchronously before the queued #onConnected microtask drains.

Security risks

None. The guard is purely defensive against an internal callback-ordering race; it doesn't touch auth, TLS, query escaping, or any user-input handling. The test fixture binds to 127.0.0.1:0 in a subprocess.

Level of scrutiny

Medium-high. The runtime change is only ~3 lines of logic per adapter, and tracing the state machine confirms the only legitimate transitions into #onConnected are from pending (constructor and #doRetry() both set state = pending before #startConnection()), so the guard cannot drop a valid callback. #onClose already handles error propagation, onFinish resolution, and adapter.release, so dropping a stale #onConnected after it doesn't leak or hang anything. That said, this is the connection-pool state machine for Bun.SQL — a production-critical path where pool corruption manifests as permanent wedging — and the fix hinges on subtle reasoning about native Rust's queue_microtask(on_connect) vs synchronous on_close dispatch ordering. A maintainer familiar with the native side (e.g. the PostgresSQLConnection lifecycle) should confirm there's no path where on_connect is the only signal and on_close never follows.

Other factors

robobun flagged failures in Build #55608 on the latest commit; I can't tell from here whether they're related or pre-existing flakes.
The new test hand-rolls Postgres wire-protocol packets and relies on uSockets delivering handshake+FIN in one poll dispatch; worth a quick check that it's not flaky across platforms.
coderabbit's only comment (stderr assertion) was addressed in e4c0989 and resolved.
No CODEOWNERS cover these files.

github-actions Bot added the claude label May 17, 2026

robobun mentioned this pull request May 17, 2026

Bun.SQL pool permanently corrupted when all pool connections are closed server-side #30947

Open

[autofix.ci] apply automated fixes

4cb22a7

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread test/js/sql/postgres-close-during-handshake.test.ts

claude Bot reviewed May 17, 2026

View reviewed changes

robobun and others added 2 commits May 17, 2026 23:47

test: surface fixture stderr in assertion diff

e4c0989

ci: retrigger

554c448

claude Bot reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950
robobun wants to merge 4 commits into
mainfrom
farm/f6fb4a56/sql-pool-stale-onconnected

robobun commented May 17, 2026

Uh oh!

robobun commented May 17, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robobun commented May 17, 2026

Repro

Cause

Fix

Verification

Uh oh!

robobun commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What the issue is

How it manifests

Why nothing else catches this

Convention

Step-by-step proof

Suggested fix

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robobun commented May 17, 2026 •

edited

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading