Skip to content

parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011

Open
dmatth1 wants to merge 1 commit into
apache:mainfrom
dmatth1:sbbf-simd
Open

parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)#10011
dmatth1 wants to merge 1 commit into
apache:mainfrom
dmatth1:sbbf-simd

Conversation

@dmatth1
Copy link
Copy Markdown

@dmatth1 dmatth1 commented May 23, 2026

Which issue does this PR close?

No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on dev@arrow.apache.org for the C++ port of the same kernel.

Rationale for this change

Sbbf::check / Sbbf::insert are on the hot path of Parquet row-group skipping for every reader downstream of arrow-rs (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;
the K=8 lane test collapses to one vptest (_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.

What changes are included in this PR?

  • AVX2 kernel in simd_x86, dispatched via cached is_x86_feature_detected!("avx2") (dead-coded when -C target-cpu=native).
  • Scalar Block::{check,insert} retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.
  • Block changed from #[repr(transparent)] to #[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.
  • parquet/benches/bloom_filter.rs gains bench_check (miss/hit × three cache regimes) and bench_insert exercising the public API.

Are these changes tested?

Yes. The 31 pre-existing bloom_filter unit tests continue to pass on x86_64 with and without -C target-cpu=native. Two new diff tests — test_simd_{check,insert}_matches_scalar — assert bit-identical AVX2-vs-scalar output across 10K
random (block, hash) pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message.

Are there any user-facing changes?

No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster Sbbf::check / Sbbf::insert on x86_64 hosts with AVX2.


The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above. cargo fmt --all -- --check and cargo clippy -p parquet --all-targets -- -D warnings both clean on this branch.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 23, 2026
@jhorstmann
Copy link
Copy Markdown
Contributor

It looks to me like with one small change to the check function, autovectorization can already generate the same instruction sequences: https://rust.godbolt.org/z/a3zT5Gezr

dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because
the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (tf-shim) | Speedup |
  |-----------|--------|-------:|------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |              4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |              4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |              5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |              7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |              7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |              8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |             11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |             11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |             12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
@dmatth1
Copy link
Copy Markdown
Author

dmatth1 commented May 23, 2026

Great callout. Measured bench and the numbers with autovectorization are better:

Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public Sbbf::{check,insert} API (XXH64 + probe), criterion default profile, ns/op:

Regime Path Scalar Autovec (this) Hand-written AVX2 Autovec vs scalar Autovec vs hand-written
S 128 KiB miss 13.02 4.96 5.14 2.62× +4%
S 128 KiB hit 13.47 4.95 5.20 2.72× +5%
S 128 KiB insert 11.62 5.41 5.38 2.15× tied
M 2 MiB miss 18.88 7.47 8.18 2.53× +9%
M 2 MiB hit 18.12 7.22 8.01 2.51× +11%
M 2 MiB insert 14.99 8.45 8.59 1.77× tied
L 32 MiB miss 27.56 11.07 13.47 2.49× +18%
L 32 MiB hit 26.57 11.23 13.40 2.37× +16%
L 32 MiB insert 23.53 12.77 12.60 1.84× tied

Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf
This unlocks Neon/aarch64 so I will bundle those numbers into the next revision.

dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single
probe implementation lives in `Block::{check,insert}` and a thin
`#[target_feature(enable = "avx2")]` shim calls into it. Because the
shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body
to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline
`x86-64` build, no `target-cpu` flag required. The shim is reached
after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`).

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the shim.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

The same branchless `Block::check` also autovectorizes to NEON on
aarch64 — no shim, no `target_feature` needed (NEON is baseline). On
main, the short-circuit form left aarch64 fully scalar.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`:

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

  aarch64 — Apple Silicon M1:

  | Regime    | Path   | Scalar | Autovec (NEON) | Speedup |
  |-----------|--------|-------:|---------------:|--------:|
  | S 128 KiB | miss   |   4.61 |           3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |           3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |           3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |           3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |           3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |           3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |           5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |           5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |           5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim
against the baseline-compiled scalar across 10K random pairs;
`test_check_matches_reference_aarch64` diffs the autovec'd check
against an inline short-circuit reference for the aarch64 codegen
path. All bloom_filter tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON
intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check`
is rewritten in the vectorizer-friendly branchless shape and LLVM
autovectorizes it directly to whatever SIMD ISA is enabled at compile
time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the
convention for analytical Rust binaries (Polars, DataFusion, Databend
distros). A runtime AVX2-detect shim was prototyped and rejected for
this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch
branch in the hot path, in exchange for AVX2 codegen on default-built
binaries running on AVX2 hardware. The simplification was preferred.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON
intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check`
is rewritten in the vectorizer-friendly branchless shape and LLVM
autovectorizes it directly to whatever SIMD ISA is enabled at compile
time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the
convention for analytical Rust binaries (Polars, DataFusion, Databend
distros). A runtime AVX2-detect shim was prototyped and rejected for
this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch
branch in the hot path, in exchange for AVX2 codegen on default-built
binaries running on AVX2 hardware. The simplification was preferred.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.

Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert ties on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`.

Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
`Sbbf::{check,insert}` are on the hot path of Parquet row-group
skipping for every reader downstream of `arrow-rs` (DataFusion,
Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit
Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t`
halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and
an equivalent SIMD reduce on NEON. This PR vectorises the probe
without changing the algorithm, hash, salts, or wire format.

Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.

Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert ties on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`.

Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane
test collapses to one `vptest` (`_mm256_testc_si256`). This PR
vectorises that loop without changing the algorithm, hash, salts, or
wire format.

Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the
vectorizer-friendly branchless shape and LLVM autovectorizes it
directly to whatever SIMD ISA is enabled at compile time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime
AVX2-detect shim was prototyped but I prefer this simplification.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane
test collapses to one `vptest` (`_mm256_testc_si256`). This PR
vectorises that loop without changing the algorithm, hash, salts, or
wire format.

Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the
vectorizer-friendly branchless shape and LLVM autovectorizes it
directly to whatever SIMD ISA is enabled at compile time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime
AVX2-detect shim was prototyped but I prefer this simplification.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
@dmatth1
Copy link
Copy Markdown
Author

dmatth1 commented May 23, 2026

Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):

Regime Path Scalar Autovec Speedup
S 128 KiB miss 4.61 3.24 1.42x
S 128 KiB hit 6.84 3.17 2.16x
S 128 KiB insert 3.25 3.19 1.02x
M 2 MiB miss 5.20 3.24 1.61x
M 2 MiB hit 7.16 3.26 2.20x
M 2 MiB insert 3.34 3.31 1.01x
L 32 MiB miss 6.66 5.42 1.23x
L 32 MiB hit 9.72 5.25 1.85x
L 32 MiB insert 5.19 5.38 0.96x

Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach.

One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no unsafe, no Sbbf field, no hot-path branch) since users who care about AVX2 probably already set -C target-cpu=....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants