Skip to content

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950

Open
robobun wants to merge 4 commits into
mainfrom
farm/f6fb4a56/sql-pool-stale-onconnected
Open

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950
robobun wants to merge 4 commits into
mainfrom
farm/f6fb4a56/sql-pool-stale-onconnected

Conversation

@robobun
Copy link
Copy Markdown
Collaborator

@robobun robobun commented May 17, 2026

Fixes #30947.

Repro

The issue's script, against local Postgres:

$ DATABASE_URL=postgres://… bun bug.ts
[killed] {"n":"10"}
[q0] {"err":"connection must be a PostgresSQLConnection"}
[q1] {"err":"connection must be a PostgresSQLConnection"}
[q2] {"err":"connection must be a PostgresSQLConnection"}
[q3] {"err":"connection must be a PostgresSQLConnection"}
[q4] {"err":"connection must be a PostgresSQLConnection"}

30–40% failure rate on a max: 10 pool when all connections are closed
server-side while the event loop is blocked (the repro blocks via
Bun.spawnSync running pg_terminate_backend; in production this is
a pooler's idle reaper, a network blip, or Supabase's session pooler).
The pool never recovers; sql.close() afterwards spins at 100% CPU.

Cause

PooledPostgresConnection.#onConnected is wired to the native
PostgresSQLConnection's on_connect callback, which Rust schedules
as a microtask (queue_microtask) from set_status(Connected) after
the server sends ReadyForQuery. #onClose is the paired callback,
which fail_with_js_value fires synchronously.

When the server's FIN lands in the same I/O tick as ReadyForQuery
(uSockets' poll returns READABLE | EPOLLHUP in one wake, common on
loopback or when a pooler closes many idle conns in a batch), uSockets
dispatches us_dispatch_data then us_dispatch_end in the same
us_internal_dispatch_ready_poll call:

  1. on_data parses ReadyForQueryqueue_microtask(on_connect).
    On event_loop.exit() microtasks drain, #onConnected would run.
  2. But then the handshake response carried the admin-shutdown
    ErrorResponse too, so on_data calls fail — which runs
    on_close synchronously. #onClose nulls this.connection,
    sets state = closed, removes from readyConnections.
  3. Pre-drain the microtask queue still holds the on_connect callback
    from step 1. When it fires, #onConnected blindly overwrites
    state = connected and re-adds the entry to readyConnections
    but this.connection is still null.

The next query sees this ghost in readyConnections.size > 0,
flushConcurrentQueries dispatches onQueryConnected(null, pooledConn),
and handle.run(pooledConn.connection /* null */, query) trips Rust's
from_js_ref guard:

// src/sql_jsc/postgres/PostgresSQLQuery.rs:536
let Some(connection) = postgres_sql_connection::js::from_js_ref(arguments[0]) else {
    return Err(global_object.throw(format_args!("connection must be a PostgresSQLConnection")));
};

#doRetry only runs for conns with state === closed, so a ghost
in connected is never refreshed — the pool is permanently wedged.

Fix

#onConnected refuses stale transitions:

if (this.state !== PooledConnectionState.pending) {
  return;
}

The only legitimate caller is the initial connect / #doRetry path,
both of which set state = pending right before invoking
#startConnection. Anything else is a racing microtask from a
PostgresSQLConnection whose on_close already ran — those have to be
dropped on the floor, not promoted back into readyConnections.

Same race exists in PooledMySQLConnection (MySQL uses the same
state-machine shape and the same on_connect microtask pattern) so the
guard is applied to both adapters.

Verification

test/js/sql/postgres-close-during-handshake.test.ts — fake TCP server
that serves the full trust-mode handshake (AuthOk + ParameterStatus
stack + BackendKeyData + ReadyForQuery + admin-shutdown ErrorResponse)
and FINs every socket. Runs under plain bun bd test — no Docker.

Without the fix, the fixture's first iteration throws
"connection must be a PostgresSQLConnection" and corrupted: true
gets printed; the test expects corrupted: false and fails. With
the fix, the pool keeps reconnecting cleanly through 20 iterations
and the test passes in ~3.5s.

Verified:

  • repro script fails ~30% without fix, 0% with fix (release bun,
    real Postgres)
  • fake-server test reliably fails without fix, passes with fix
    (bun bd test ASAN debug)
  • existing test/js/sql/* tests unaffected

…/MySQL

When a PooledPostgresConnection or PooledMySQLConnection's socket is closed
in the same I/O tick as ReadyForQuery/handshake completion, #onClose runs
synchronously before the queued on_connect microtask. The previous
#onConnected blindly set state = connected and re-added the dead entry to
readyConnections with this.connection === null. Subsequent queries dispatched
null to Rust's PostgresSQLQuery.run and failed with 'connection must be a
PostgresSQLConnection' forever — the pool never retried a conn it thought
was live.

Bail out of #onConnected when state !== pending. Same shape for MySQL.

Fixes #30947
@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented May 17, 2026

Updated 10:49 PM PT - May 17th, 2026

@robobun, your commit 554c448 has 1 failures in Build #55613 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30950

That installs a local version of the PR into your bun-30950 executable, so you can run:

bun-30950 --bun

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd6797ef-0da3-4b65-b0ea-c8174d561579

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb22a7 and e4c0989.

📒 Files selected for processing (1)
  • test/js/sql/postgres-close-during-handshake.test.ts

Walkthrough

Adds early-return guards in pooled connection onConnected handlers (MySQL and Postgres) to ignore stale callbacks when the connection state is no longer pending, and adds a Postgres regression test that simulates a server close during handshake to validate the fix.

Changes

SQL pool connection state race condition

Layer / File(s) Summary
Connection state guards in handlers
src/js/internal/sql/mysql.ts, src/js/internal/sql/postgres.ts
PooledMySQLConnection.#onConnected and PooledPostgresConnection.#onConnected add early-return checks to ensure the pooled connection is still pending before processing the ReadyForQuery/on_connect callback; if not pending, the callback returns early.
Regression test for mid-handshake close
test/js/sql/postgres-close-during-handshake.test.ts
Adds a fixture script and test that spawn a fake Postgres server which completes the startup handshake then immediately sends an admin-shutdown and closes. The test runs multiple pooled SELECT 1 queries and asserts the pool is not corrupted and the fixture exits successfully.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding guards to ignore stale on_connect callbacks after #onClose in pooled Postgres/MySQL connections.
Description check ✅ Passed The description is comprehensive with clear sections: Repro, Cause, Fix, and Verification. It thoroughly explains the issue, root cause, solution, and testing approach.
Linked Issues check ✅ Passed The changes directly address issue #30947 by adding state checks to prevent stale on_connect callbacks from corrupting the connection pool when connections close server-side during handshake.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the connection pool corruption issue: guards in Postgres/MySQL handlers and a regression test for the specific failure scenario.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. Postgres connection leak #23215 - Unbounded connection growth beyond configured max is consistent with the race where #onConnected re-adds dead connections to readyConnections, causing the pool to create replacements without accounting for corrupted entries
  2. Crash in createInstance in Postgres client in Bun.SQL #24434 - Crash in PostgresSQLConnection__createInstance triggered from a microtask matches the race where a pending #onConnected microtask fires after #onClose has already nulled this.connection

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #23215
Fixes #24434

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/sql/postgres-close-during-handshake.test.ts`:
- Around line 136-147: Read and assert stderr explicitly before checking payload
state and ensure the process exit code is asserted last: after you obtain
stdout, stderr, and exitCode, add expect(stderr).toBe(""); then parse the last
stdout line into parsed (as already done) and assert expect({ corrupted:
parsed.corrupted }).toEqual({ corrupted: false }); finally assert
expect(exitCode).toEqual(0). Use the existing variables (stdout, stderr,
exitCode, line, parsed) and keep the exit-code assertion as the final check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e1da5fe9-6dff-41eb-8f5e-2c3cd366a376

📥 Commits

Reviewing files that changed from the base of the PR and between 2edc9e4 and 4cb22a7.

📒 Files selected for processing (1)
  • test/js/sql/postgres-close-during-handshake.test.ts

Comment thread test/js/sql/postgres-close-during-handshake.test.ts
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 test/js/sql/postgres-close-during-handshake.test.ts:135-151 — nit: stderr is collected here but never asserted on or surfaced. If the fixture crashes before printing JSON, the test will fail with {corrupted: undefined, exitCode: <nonzero>} and the actual error will be silently discarded — consider folding stderr into the toEqual object (or logging it on failure) so CI failures are debuggable.

    Extended reasoning...

    What the issue is

    At lines 136-138 the test destructures stderr from Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]), but stderr is never referenced again. The final assertion at lines 148-151 only checks { corrupted: parsed.corrupted, exitCode }, so whatever the subprocess wrote to stderr is read into a local and dropped.

    How it manifests

    The fixture is expected to print exactly one JSON line on stdout and exit 0. If anything goes wrong before that console.log(JSON.stringify(...)) — a panic in the native PostgresSQLConnection, an uncaught exception in the net.createServer setup, an assertion failure in a debug build, or a future refactor that introduces a syntax error — the subprocess will write its diagnostic to stderr and exit non-zero, and stdout will be empty. stdout.trim().split("\n").at(-1) then yields "", parsed becomes {}, and the assertion fails with:

    expect({ corrupted: undefined, exitCode: 1 }).toEqual({ corrupted: false, exitCode: 0 })
    

    That diff tells you that the subprocess crashed, but not why. The actual error (stack trace, panic message, ASAN report) was sitting in stderr and got thrown away.

    Why nothing else catches this

    The test pipes both stdout and stderr (stdout: "pipe", stderr: "pipe"), so the subprocess's stderr does not inherit to the test runner's terminal — it is only visible if the test code explicitly surfaces it. There is no other assertion or console.error(stderr) path. In CI the only artifact is the jest diff, which contains none of the diagnostic.

    Convention

    Per test/CLAUDE.md, spawned-process tests in this repo assert on stderr (e.g. expect(stderr).toBe("")) before the exit-code check, or fold it into the snapshot object, precisely so the failure diff carries the real error. The pattern here matches every other field of that convention except this one.

    Step-by-step proof

    1. Suppose a future change to src/js/internal/sql/postgres.ts regresses such that new SQL({...}) throws synchronously (or the debug build hits an assert).
    2. The fixture process writes the uncaught-exception stack to stderr and exits with code 1, never reaching console.log.
    3. stdout is "", so line = "", parsed = {}, parsed.corrupted = undefined.
    4. stderr contains the full stack trace, but is never read after destructuring.
    5. The test fails with { corrupted: undefined, exitCode: 1 } != { corrupted: false, exitCode: 0 } — no hint that the failure was (say) TypeError: Cannot read properties of undefined rather than the pool-corruption regression this test guards against.

    Suggested fix

    Fold stderr into the assertion so it appears in the diff:

    expect({ stderr, corrupted: parsed.corrupted, exitCode }).toEqual({
      stderr: "",
      corrupted: false,
      exitCode: 0,
    });

    If the fixture is expected to emit some stderr (e.g. connection-closed warnings), an alternative is to keep the existing assertion but include stdout/stderr in the object so they show up in the failure diff without being matched against, or simply console.error(stderr) before the expect.

    This is purely a debuggability/test-quality concern — it does not affect whether the test passes or fails for the bug being fixed, so filing as a nit.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard looks correct and the race analysis is convincing, but this is a state-machine change in the production SQL connection pool (Postgres + MySQL) with subtle microtask-vs-sync-callback ordering, so it's worth a maintainer's eye — also note Build #55608 is showing failures.

Extended reasoning...

Overview

This PR adds a single early-return guard at the top of #onConnected in both PooledPostgresConnection (src/js/internal/sql/postgres.ts) and PooledMySQLConnection (src/js/internal/sql/mysql.ts): if this.state !== PooledConnectionState.pending, the callback is dropped. It also adds a ~150-line regression test (test/js/sql/postgres-close-during-handshake.test.ts) that spins up a fake Postgres TCP server which sends the full trust-mode handshake plus an admin-shutdown error and immediately FINs, to deterministically reproduce the race where #onClose runs synchronously before the queued #onConnected microtask drains.

Security risks

None. The guard is purely defensive against an internal callback-ordering race; it doesn't touch auth, TLS, query escaping, or any user-input handling. The test fixture binds to 127.0.0.1:0 in a subprocess.

Level of scrutiny

Medium-high. The runtime change is only ~3 lines of logic per adapter, and tracing the state machine confirms the only legitimate transitions into #onConnected are from pending (constructor and #doRetry() both set state = pending before #startConnection()), so the guard cannot drop a valid callback. #onClose already handles error propagation, onFinish resolution, and adapter.release, so dropping a stale #onConnected after it doesn't leak or hang anything. That said, this is the connection-pool state machine for Bun.SQL — a production-critical path where pool corruption manifests as permanent wedging — and the fix hinges on subtle reasoning about native Rust's queue_microtask(on_connect) vs synchronous on_close dispatch ordering. A maintainer familiar with the native side (e.g. the PostgresSQLConnection lifecycle) should confirm there's no path where on_connect is the only signal and on_close never follows.

Other factors

  • robobun flagged failures in Build #55608 on the latest commit; I can't tell from here whether they're related or pre-existing flakes.
  • The new test hand-rolls Postgres wire-protocol packets and relies on uSockets delivering handshake+FIN in one poll dispatch; worth a quick check that it's not flaky across platforms.
  • coderabbit's only comment (stderr assertion) was addressed in e4c0989 and resolved.
  • No CODEOWNERS cover these files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bun.SQL pool permanently corrupted when all pool connections are closed server-side

1 participant