parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011
parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011dmatth1 wants to merge 1 commit into
Conversation
|
It looks to me like with one small change to the |
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (tf-shim) | Speedup | |-----------|--------|-------:|------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
|
Great callout. Measured bench and the numbers with autovectorization are better: Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public
Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf |
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single probe implementation lives in `Block::{check,insert}` and a thin `#[target_feature(enable = "avx2")]` shim calls into it. Because the shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, no `target-cpu` flag required. The shim is reached after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`). Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the shim. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. The same branchless `Block::check` also autovectorizes to NEON on aarch64 — no shim, no `target_feature` needed (NEON is baseline). On main, the short-circuit form left aarch64 fully scalar. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`: | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1: | Regime | Path | Scalar | Autovec (NEON) | Speedup | |-----------|--------|-------:|---------------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim against the baseline-compiled scalar across 10K random pairs; `test_check_matches_reference_aarch64` diffs the autovec'd check against an inline short-circuit reference for the aarch64 codegen path. All bloom_filter tests pass with and without `-C target-cpu=native`.
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.
`Sbbf::{check,insert}` are on the hot path of Parquet row-group
skipping for every reader downstream of `arrow-rs` (DataFusion,
Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit
Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t`
halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and
an equivalent SIMD reduce on NEON. This PR vectorises the probe
without changing the algorithm, hash, salts, or wire format.
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.
Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).
A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.
x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
`-C target-cpu=x86-64-v3`:
| Regime | Path | Scalar | Autovec | Speedup |
|-----------|--------|-------:|--------:|--------:|
| S 128 KiB | miss | 13.02 | 4.96 | 2.62x |
| S 128 KiB | hit | 13.47 | 4.95 | 2.72x |
| S 128 KiB | insert | 11.62 | 5.41 | 2.15x |
| M 2 MiB | miss | 18.88 | 7.47 | 2.53x |
| M 2 MiB | hit | 18.12 | 7.22 | 2.51x |
| M 2 MiB | insert | 14.99 | 8.45 | 1.77x |
| L 32 MiB | miss | 27.56 | 11.07 | 2.49x |
| L 32 MiB | hit | 26.57 | 11.23 | 2.37x |
| L 32 MiB | insert | 23.53 | 12.77 | 1.84x |
aarch64 — Apple Silicon M1 (NEON via baseline autovec):
| Regime | Path | Scalar | Autovec | Speedup |
|-----------|--------|-------:|--------:|--------:|
| S 128 KiB | miss | 4.61 | 3.24 | 1.42x |
| S 128 KiB | hit | 6.84 | 3.17 | 2.16x |
| S 128 KiB | insert | 3.25 | 3.19 | 1.02x |
| M 2 MiB | miss | 5.20 | 3.24 | 1.61x |
| M 2 MiB | hit | 7.16 | 3.26 | 2.20x |
| M 2 MiB | insert | 3.34 | 3.31 | 1.01x |
| L 32 MiB | miss | 6.66 | 5.42 | 1.23x |
| L 32 MiB | hit | 9.72 | 5.25 | 1.85x |
| L 32 MiB | insert | 5.19 | 5.38 | 0.96x |
Insert ties on aarch64 because main's `Block::insert` was already
vectorizer-friendly. The PR's aarch64 win lives in `check`.
Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
|
Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):
Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach. One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no |
Which issue does this PR close?
No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on
dev@arrow.apache.orgfor the C++ port of the same kernel.Rationale for this change
Sbbf::check/Sbbf::insertare on the hot path of Parquet row-group skipping for every reader downstream ofarrow-rs(DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;the K=8 lane test collapses to one
vptest(_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.What changes are included in this PR?
simd_x86, dispatched via cachedis_x86_feature_detected!("avx2")(dead-coded when-C target-cpu=native).Block::{check,insert}retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.Blockchanged from#[repr(transparent)]to#[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.parquet/benches/bloom_filter.rsgainsbench_check(miss/hit × three cache regimes) andbench_insertexercising the public API.Are these changes tested?
Yes. The 31 pre-existing
bloom_filterunit tests continue to pass on x86_64 with and without-C target-cpu=native. Two new diff tests —test_simd_{check,insert}_matches_scalar— assert bit-identical AVX2-vs-scalar output across 10Krandom
(block, hash)pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message.Are there any user-facing changes?
No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster
Sbbf::check/Sbbf::inserton x86_64 hosts with AVX2.The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above.
cargo fmt --all -- --checkandcargo clippy -p parquet --all-targets -- -D warningsboth clean on this branch.