perf: cache thread_id::current() in a #[thread_local] slot#30971
Draft
Jarred-Sumner wants to merge 24 commits into
Draft
perf: cache thread_id::current() in a #[thread_local] slot#30971Jarred-Sumner wants to merge 24 commits into
Jarred-Sumner wants to merge 24 commits into
Conversation
Collaborator
|
Updated 4:41 AM PT - May 18th, 2026
✅ @Jarred-Sumner, your commit 702135b1406dabf1cb428976d49306e446679b8b passed in 🧪 To try this PR locally: bunx bun-pr 30971That installs a local version of the PR into your bun-30971 --bun |
The bundler's Worker::get(ctx) calls bun_threading::current_thread_id() once per scheduled task to look up the thread's Worker in the pool's assignment map. That routes to bun_core::thread_id::current(), which made a fresh gettid()/pthread_threadid_np()/GetCurrentThreadId() syscall on every call. A 19 K-module bundle (rolldown apps/10000) schedules ~5.7 tasks per module (parse, line-offset table, quoted source contents, compile-result generation, link step 5), so it paid ~109 K gettid syscalls vs. the Zig version's ~129 - about 36% of the build's total syscall time. Zig's std.Thread.getCurrentId() doesn't have this cost: LinuxThreadImpl reads a threadlocal var tls_thread_id set once at thread start (vendor/zig/lib/std/Thread.zig:841,885). Cache the result in a bare #[thread_local] Cell<ThreadId> slot so subsequent calls are a single TLS load with no LocalKey initialization branch or destructor registration. Lazy rather than set-at-spawn so threads not started through Bun's pool (FFI callbacks, the main thread) still get a valid ID; 0 is the unset sentinel since kernel TIDs and Win32/Darwin thread IDs are nonzero.
…e lexer log lifetime
…p (WIP: ~60 errors)
…ParseTask - ParseResult/ParseOptions carry the arena lifetime; cold loader fns take &'a Arena - ResolveImportRecordCtx/ImportInfo take &[ImportRecord] (allocator-agnostic) - arena-allocate parser Source so Ast<'bump> isn't pinned to the stack frame - ArenaVec call sites use std slice/index ops instead of BabyListExt - Worker::arena() returns &'static (centralises the per-task detach)
…allers - ParseOptions splits arena lifetime from short-lived input borrows - DevServer CurrentBundle owns the boxed arena bv2.graph.heap borrows - JSTranspiler/jsc_hooks reuse the existing per-call arena erasure for ParseOptions.arena - AsyncModule/js_bundle_completion_task adapt to borrowed Graph.heap
…LinkerGraph::load Per-file PartList/import_record::List buffers come from per-worker mi_heaps, which mi_heap_malloc cannot grow from the linker thread. Bitwise-move them into the linker-thread arena alongside the existing symbol-map copy so add_part_to_file etc. can append. The parse-side alias keeps the original handle (slab-freed without element drop, same as before).
…ager re-seat Replace LinkerGraph::load's reseat_col! (Vec::with_capacity_in + memcpy for every file's parts/import_records) with bun_alloc::transfer_arena — swap the ArenaVec's &Arena handle from the per-worker mi_heap to the bundle-thread heap via ManuallyDrop + from_raw_parts_in. Only files the linker actually grows pay a (lazy) cross-heap mi_heap_realloc migration. <&MimallocArena as Allocator>::deallocate is heap-agnostic mi_free, and grow is mi_heap_realloc_aligned(dst, ptr, ..) — alloc on dst, mi_free old — so retagging preserves the single-thread-alloc contract while matching Zig's BabyList.transferOwnership (release no-op there because BabyList is allocator-erased; Vec<T,&Arena> stores the handle, hence the swap). Drop the post-step-5 take_ast_ownership call: do_step_5 only pushes to global-allocator Vecs (Dependency, local_parts_with_uses), never to the arena-backed PartList/import-record columns. rolldown apps/10000 (--production --sourcemap, 8 runs): wall 520ms -> 501ms RSS 947MB -> 896MB vs bun-1.3.14: 433ms / 647MB
… (thread, pool) Keyed on a monotonic per-pool generation (not pool address — Bun.build() reuse makes pointer identity ABA). Drops the workers_assignments lock from the ~100K-per-build hot path to ~nthreads acquisitions; perf attributed ~97% of the build's futex traffic to the per-call lock on the rolldown 19K-module benchmark. Also drops the dead HELP_CATCH_MEMORY_ISSUES blocks in Worker::get/unget and the stale bumpalo references in this file.
…u32, _> source_index keys are dense 0..module_count and this map is probed once per import inside on_parse_task_complete (the main-thread parse-phase throughput limiter). Replaces hash+probe with direct index.
…resolve_without_symlinks
…pping through Ast::empty_in + init
…red_imports The Zig original used a 4096-byte stack-fallback ArrayHashMap; the Rust port heap-allocated an ArrayHashMap<u32, ()> per parsed file. Swap to AutoBitSet sized to file_import_records.len() — it stays in its inline 2-word Static arm for the typical <128-record file and is O(1) word ops to set/probe instead of hash+probe.
…ep-cloning Zig's Entry.data holds slices/pointers so its by-value return is a shallow few-word copy. The Rust port made EntryData own boxed slices/Vecs, so entry.value.clone() and exports.clone() deep-copied the entire conditions subtree on every resolve. Return Option<&Entry> from value_for_key and match exports by reference in resolve_exports; resolve_target already takes &Entry so callers just drop the local sentinel and pass the borrow through.
wtf/Int128.h dropped its <cassert> include in the latest WebKit bump, which was the only thing declaring assert() for uv__tty_make_raw() in the unified build.
…ate on mi_heap_destroy set_thread_heap() previously bump_reset() unconditionally, so the bundler's per-task Worker::get → ASTMemoryAllocator::push() abandoned a 16 KB bump chunk on every task (~70K tasks × 16 KB ≈ 1.1 GB into never-reset worker arenas, mostly <500 B used per chunk). Now tracks BUMP_HEAP (the chunk's owner) and keeps the cursor when re-entering that same heap; MimallocArena reset/Drop calls bump_invalidate_heap() before mi_heap_destroy so a recycled mi_heap_t* slot can't ABA-match a stale cursor. rolldown apps/10000 (20K modules): peak RSS 895 → 607 MB, wall 466 → 448 ms.
cf0ebee to
702135b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bundler's
Worker::get(ctx)callsbun_threading::current_thread_id()once per scheduled task to look up the thread'sWorkerin the pool's assignment map. That routes tobun_core::thread_id::current(), which made a freshgettid()/pthread_threadid_np()/GetCurrentThreadId()syscall on every call.A 19K-module bundle (rolldown
apps/10000) schedules ~5.7 tasks per module — parse, line-offset table, quoted source contents, compile-result generation, link step 5 — so it paid ~109,000gettidsyscalls vs ~129 inbun-1.3.14. That was ~36% of the build's total syscall time and a ~15-19% wall-clock regression on the benchmark.Zig's
std.Thread.getCurrentId()doesn't have this cost:LinuxThreadImplreads athreadlocal var tls_thread_idset once at thread start (vendor/zig/lib/std/Thread.zig:841,885). Cache the result in a bare#[thread_local] Cell<ThreadId>slot — same__thread/local-execTLS model as Zig'sthreadlocal var, noLocalKeyinitialization branch or destructor registration. Lazy rather than set-at-spawn so threads not started through Bun's pool (FFI callbacks, the main thread) still get a valid ID.Reproduce: