Skip to content

perf: cache thread_id::current() in a #[thread_local] slot#30971

Draft
Jarred-Sumner wants to merge 24 commits into
mainfrom
claude/bundler-perfile-overhead
Draft

perf: cache thread_id::current() in a #[thread_local] slot#30971
Jarred-Sumner wants to merge 24 commits into
mainfrom
claude/bundler-perfile-overhead

Conversation

@Jarred-Sumner
Copy link
Copy Markdown
Collaborator

The bundler's Worker::get(ctx) calls bun_threading::current_thread_id() once per scheduled task to look up the thread's Worker in the pool's assignment map. That routes to bun_core::thread_id::current(), which made a fresh gettid() / pthread_threadid_np() / GetCurrentThreadId() syscall on every call.

A 19K-module bundle (rolldown apps/10000) schedules ~5.7 tasks per module — parse, line-offset table, quoted source contents, compile-result generation, link step 5 — so it paid ~109,000 gettid syscalls vs ~129 in bun-1.3.14. That was ~36% of the build's total syscall time and a ~15-19% wall-clock regression on the benchmark.

Zig's std.Thread.getCurrentId() doesn't have this cost: LinuxThreadImpl reads a threadlocal var tls_thread_id set once at thread start (vendor/zig/lib/std/Thread.zig:841,885). Cache the result in a bare #[thread_local] Cell<ThreadId> slot — same __thread/local-exec TLS model as Zig's threadlocal var, no LocalKey initialization branch or destructor registration. Lazy rather than set-at-spawn so threads not started through Bun's pool (FFI callbacks, the main thread) still get a valid ID.

Reproduce:

git clone https://github.com/rolldown/benchmarks
cd benchmarks/apps/10000 && bun add react react-dom @iconify-icons/material-symbols @iconify/react
hyperfine "bun build --outdir=dist-bun --production --sourcemap ./src/index.jsx"

@robobun
Copy link
Copy Markdown
Collaborator

robobun commented May 18, 2026

Updated 4:41 AM PT - May 18th, 2026

@Jarred-Sumner, your commit 702135b1406dabf1cb428976d49306e446679b8b passed in Build #55774! 🎉


🧪   To try this PR locally:

bunx bun-pr 30971

That installs a local version of the PR into your bun-30971 executable, so you can run:

bun-30971 --bun

Jarred-Sumner and others added 24 commits May 18, 2026 10:46
The bundler's Worker::get(ctx) calls bun_threading::current_thread_id() once
per scheduled task to look up the thread's Worker in the pool's assignment
map. That routes to bun_core::thread_id::current(), which made a fresh
gettid()/pthread_threadid_np()/GetCurrentThreadId() syscall on every call. A
19 K-module bundle (rolldown apps/10000) schedules ~5.7 tasks per module
(parse, line-offset table, quoted source contents, compile-result generation,
link step 5), so it paid ~109 K gettid syscalls vs. the Zig version's ~129 -
about 36% of the build's total syscall time.

Zig's std.Thread.getCurrentId() doesn't have this cost: LinuxThreadImpl reads
a threadlocal var tls_thread_id set once at thread start
(vendor/zig/lib/std/Thread.zig:841,885). Cache the result in a bare
#[thread_local] Cell<ThreadId> slot so subsequent calls are a single TLS load
with no LocalKey initialization branch or destructor registration. Lazy rather
than set-at-spawn so threads not started through Bun's pool (FFI callbacks,
the main thread) still get a valid ID; 0 is the unset sentinel since kernel
TIDs and Win32/Darwin thread IDs are nonzero.
…ParseTask

- ParseResult/ParseOptions carry the arena lifetime; cold loader fns take &'a Arena
- ResolveImportRecordCtx/ImportInfo take &[ImportRecord] (allocator-agnostic)
- arena-allocate parser Source so Ast<'bump> isn't pinned to the stack frame
- ArenaVec call sites use std slice/index ops instead of BabyListExt
- Worker::arena() returns &'static (centralises the per-task detach)
…allers

- ParseOptions splits arena lifetime from short-lived input borrows
- DevServer CurrentBundle owns the boxed arena bv2.graph.heap borrows
- JSTranspiler/jsc_hooks reuse the existing per-call arena erasure for ParseOptions.arena
- AsyncModule/js_bundle_completion_task adapt to borrowed Graph.heap
…LinkerGraph::load

Per-file PartList/import_record::List buffers come from per-worker mi_heaps,
which mi_heap_malloc cannot grow from the linker thread. Bitwise-move them
into the linker-thread arena alongside the existing symbol-map copy so
add_part_to_file etc. can append. The parse-side alias keeps the original
handle (slab-freed without element drop, same as before).
…ager re-seat

Replace LinkerGraph::load's reseat_col! (Vec::with_capacity_in + memcpy
for every file's parts/import_records) with bun_alloc::transfer_arena —
swap the ArenaVec's &Arena handle from the per-worker mi_heap to the
bundle-thread heap via ManuallyDrop + from_raw_parts_in. Only files the
linker actually grows pay a (lazy) cross-heap mi_heap_realloc migration.

<&MimallocArena as Allocator>::deallocate is heap-agnostic mi_free, and
grow is mi_heap_realloc_aligned(dst, ptr, ..) — alloc on dst, mi_free old
— so retagging preserves the single-thread-alloc contract while matching
Zig's BabyList.transferOwnership (release no-op there because BabyList is
allocator-erased; Vec<T,&Arena> stores the handle, hence the swap).

Drop the post-step-5 take_ast_ownership call: do_step_5 only pushes to
global-allocator Vecs (Dependency, local_parts_with_uses), never to the
arena-backed PartList/import-record columns.

rolldown apps/10000 (--production --sourcemap, 8 runs):
  wall  520ms -> 501ms   RSS 947MB -> 896MB
  vs bun-1.3.14:  433ms / 647MB
… (thread, pool)

Keyed on a monotonic per-pool generation (not pool address — Bun.build() reuse
makes pointer identity ABA). Drops the workers_assignments lock from the
~100K-per-build hot path to ~nthreads acquisitions; perf attributed ~97% of
the build's futex traffic to the per-call lock on the rolldown 19K-module
benchmark.

Also drops the dead HELP_CATCH_MEMORY_ISSUES blocks in Worker::get/unget and
the stale bumpalo references in this file.
…u32, _>

source_index keys are dense 0..module_count and this map is probed once per
import inside on_parse_task_complete (the main-thread parse-phase throughput
limiter). Replaces hash+probe with direct index.
…red_imports

The Zig original used a 4096-byte stack-fallback ArrayHashMap; the Rust
port heap-allocated an ArrayHashMap<u32, ()> per parsed file. Swap to
AutoBitSet sized to file_import_records.len() — it stays in its inline
2-word Static arm for the typical <128-record file and is O(1) word ops
to set/probe instead of hash+probe.
…ep-cloning

Zig's Entry.data holds slices/pointers so its by-value return is a
shallow few-word copy. The Rust port made EntryData own boxed
slices/Vecs, so entry.value.clone() and exports.clone() deep-copied the
entire conditions subtree on every resolve. Return Option<&Entry> from
value_for_key and match exports by reference in resolve_exports;
resolve_target already takes &Entry so callers just drop the local
sentinel and pass the borrow through.
wtf/Int128.h dropped its <cassert> include in the latest WebKit bump,
which was the only thing declaring assert() for uv__tty_make_raw() in
the unified build.
…ate on mi_heap_destroy

set_thread_heap() previously bump_reset() unconditionally, so the bundler's
per-task Worker::get → ASTMemoryAllocator::push() abandoned a 16 KB bump chunk
on every task (~70K tasks × 16 KB ≈ 1.1 GB into never-reset worker arenas,
mostly <500 B used per chunk). Now tracks BUMP_HEAP (the chunk's owner) and
keeps the cursor when re-entering that same heap; MimallocArena reset/Drop
calls bump_invalidate_heap() before mi_heap_destroy so a recycled mi_heap_t*
slot can't ABA-match a stale cursor.

rolldown apps/10000 (20K modules): peak RSS 895 → 607 MB, wall 466 → 448 ms.
@Jarred-Sumner Jarred-Sumner force-pushed the claude/bundler-perfile-overhead branch from cf0ebee to 702135b Compare May 18, 2026 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants