parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback) by dmatth1 · Pull Request #10011 · apache/arrow-rs

dmatth1 · 2026-05-23T14:16:18Z

Which issue does this PR close?

No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on dev@arrow.apache.org for the C++ port of the same kernel.

Rationale for this change

Sbbf::check / Sbbf::insert are on the hot path of Parquet row-group skipping for every reader downstream of arrow-rs (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;
the K=8 lane test collapses to one vptest (_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.

What changes are included in this PR?

AVX2 kernel in simd_x86, dispatched via cached is_x86_feature_detected!("avx2") (dead-coded when -C target-cpu=native).
Scalar Block::{check,insert} retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.
Block changed from #[repr(transparent)] to #[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.
parquet/benches/bloom_filter.rs gains bench_check (miss/hit × three cache regimes) and bench_insert exercising the public API.

Are these changes tested?

Yes. The 31 pre-existing bloom_filter unit tests continue to pass on x86_64 with and without -C target-cpu=native. Two new diff tests — test_simd_{check,insert}_matches_scalar — assert bit-identical AVX2-vs-scalar output across 10K
random (block, hash) pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message.

Are there any user-facing changes?

No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster Sbbf::check / Sbbf::insert on x86_64 hosts with AVX2.

The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above. cargo fmt --all -- --check and cargo clippy -p parquet --all-targets -- -D warnings both clean on this branch.

jhorstmann · 2026-05-23T15:33:17Z

It looks to me like with one small change to the check function, autovectorization can already generate the same instruction sequences: https://rust.godbolt.org/z/a3zT5Gezr

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (tf-shim) | Speedup | |-----------|--------|-------:|------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

dmatth1 · 2026-05-23T17:08:21Z

Great callout. Measured bench and the numbers with autovectorization are better:

Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public Sbbf::{check,insert} API (XXH64 + probe), criterion default profile, ns/op:

Regime	Path	Scalar	Autovec (this)	Hand-written AVX2	Autovec vs scalar	Autovec vs hand-written
S 128 KiB	miss	13.02	4.96	5.14	2.62×	+4%
S 128 KiB	hit	13.47	4.95	5.20	2.72×	+5%
S 128 KiB	insert	11.62	5.41	5.38	2.15×	tied
M 2 MiB	miss	18.88	7.47	8.18	2.53×	+9%
M 2 MiB	hit	18.12	7.22	8.01	2.51×	+11%
M 2 MiB	insert	14.99	8.45	8.59	1.77×	tied
L 32 MiB	miss	27.56	11.07	13.47	2.49×	+18%
L 32 MiB	hit	26.57	11.23	13.40	2.37×	+16%
L 32 MiB	insert	23.53	12.77	12.60	1.84×	tied

Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf
This unlocks Neon/aarch64 so I will bundle those numbers into the next revision.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single probe implementation lives in `Block::{check,insert}` and a thin `#[target_feature(enable = "avx2")]` shim calls into it. Because the shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, no `target-cpu` flag required. The shim is reached after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`). Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the shim. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. The same branchless `Block::check` also autovectorizes to NEON on aarch64 — no shim, no `target_feature` needed (NEON is baseline). On main, the short-circuit form left aarch64 fully scalar. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`: | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1: | Regime | Path | Scalar | Autovec (NEON) | Speedup | |-----------|--------|-------:|---------------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim against the baseline-compiled scalar across 10K random pairs; `test_check_matches_reference_aarch64` diffs the autovec'd check against an inline short-circuit reference for the aarch64 codegen path. All bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.

@jhorstmann

`Sbbf::{check,insert}` are on the hot path of Parquet row-group skipping for every reader downstream of `arrow-rs` (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t` halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and an equivalent SIMD reduce on NEON. This PR vectorises the probe without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.

@jhorstmann

Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

dmatth1 · 2026-05-23T20:43:26Z

Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):

Regime	Path	Scalar	Autovec	Speedup
S 128 KiB	miss	4.61	3.24	1.42x
S 128 KiB	hit	6.84	3.17	2.16x
S 128 KiB	insert	3.25	3.19	1.02x
M 2 MiB	miss	5.20	3.24	1.61x
M 2 MiB	hit	7.16	3.26	2.20x
M 2 MiB	insert	3.34	3.31	1.01x
L 32 MiB	miss	6.66	5.42	1.23x
L 32 MiB	hit	9.72	5.25	1.85x
L 32 MiB	insert	5.19	5.38	0.96x

Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach.

One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no unsafe, no Sbbf field, no hot-path branch) since users who care about AVX2 probably already set -C target-cpu=....

github-actions Bot added the parquet Changes to the parquet crate label May 23, 2026

dmatth1 force-pushed the sbbf-simd branch from 5dd4690 to 4f1e5df Compare May 23, 2026 15:57

dmatth1 force-pushed the sbbf-simd branch from 4f1e5df to 2c47d62 Compare May 23, 2026 20:44

dmatth1 mentioned this pull request May 24, 2026

GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec apache/arrow#50030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011

parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011
dmatth1 wants to merge 1 commit into
apache:mainfrom
dmatth1:sbbf-simd

dmatth1 commented May 23, 2026

Uh oh!

jhorstmann commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmatth1 commented May 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jhorstmann commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants