Skip to content

Optimizations/simplifications#7

Merged
kahrendt merged 11 commits into
mainfrom
flac-optimizations
May 4, 2026
Merged

Optimizations/simplifications#7
kahrendt merged 11 commits into
mainfrom
flac-optimizations

Conversation

@kahrendt
Copy link
Copy Markdown
Contributor

@kahrendt kahrendt commented May 4, 2026

Optimizes the FLAC decoder. On an S3, the performance is improved by around 15% and on a P4 performance is improved by about 10%.

  • Adds optimized output path for 4-byte aligned buffers when outputting packed 24-bit samples

  • Simplifies the code in a few spots like removing unnecessary/redundant conditional checks and brackets

  • Unrolls the CRC loop by 8 on 32-bit platforms to amortize the loop increment cost (this isn't slicing by 8 like 64-bit platforms uses, but instead if just using the one table 8 times per loop)

  • Adds several compiler macros: FLAC_NOINLINE, FLAC_ASSUME, and FLAC_NO_SANITIZE_OVERFLOW

    • FLAC_NOINLINE and FLAC_ASSUME lets us optimize hot loops better
    • FLAC_NO_SANITIZE_OVERFLOW will be useful for a potential fuzzer so that UBSan checks are disabled on the LPC restoration path which are completely unavoidable
  • Adds a local bit reader helper. This let's the compiler keep the bit buffer in registers rather than constantly pushing/pulling the member variable

  • Optimizes by adding a decode_rice_partition helper to reduce register pressure and which helps the compiler avoid using the stack for intermediate storage

  • Updates the internal documentation to describe the new bit reader helper header

  • Updates the benchmark numbers based on these optimizations

    • Adds a separate benchmark for ESP32 in internal memory

kahrendt added 10 commits May 4, 2026 08:45
Note we do not slice by 8 which requires more tables with larger flash/cache pressure, but we just use the one table up to 8 times in one call to reduce the loop overhead.
…SANITIZE_OVERFLOW.

FLAC_NOINLINE and FLAC_ASSUME lets us optimize hot loops better. FLAC_NO_SANITIZE_OVERFLOW will be useful for a potential fuzzer so that UBSan checks are disabled on the LPC restoration path which are completely unavoidable.
… a future fuzzer and remove some unnecessary brackets
…ion helper method (never inlined) to reduce register pressure.
@kahrendt kahrendt requested a review from Copilot May 4, 2026 13:39
@kahrendt kahrendt added the minor Bumps minor version label May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving FLAC decode throughput (especially on ESP32-class targets) by tightening hot-path bit reading, reducing register pressure in Rice residual decoding, and adding a faster PCM packing path for aligned 24-bit stereo output.

Changes:

  • Introduces a header-only local bit-reader (BitReaderLocal + helpers) and refactors residual Rice decoding into an out-of-lined decode_rice_partition() loop.
  • Adds a 4-byte-aligned fast path for packed 24-bit stereo PCM output.
  • Optimizes CRC16 update behavior (including a 32-bit-host unroll path) and updates internal/public benchmark documentation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/README.md Updates internal architecture docs to reflect new bit-reader helper header and behavior.
src/pcm_packing.cpp Adds an aligned 24-bit stereo packing fast path and dispatch logic.
src/lpc.cpp Applies FLAC_NO_SANITIZE_OVERFLOW to LPC restore paths for fuzzing/UBSan scenarios.
src/frame_header.cpp Removes redundant reserved-bit check and documents that it’s validated earlier.
src/flac_decoder.cpp Switches to bit_reader.h primitives, adds decode_rice_partition() to reduce register pressure.
src/decorrelation.cpp Simplifies channel assignment branching for joint-stereo decorrelation.
src/crc.cpp Refactors CRC16 update loop, adds 32-bit unrolled processing and keeps CRC in a 32-bit local.
src/compiler.h Adds FLAC_NOINLINE, FLAC_ASSUME, and FLAC_NO_SANITIZE_OVERFLOW macros.
src/bit_reader.h New header-only bit-reader implementation used by the decoder hot paths.
README.md Updates headline performance table and notes about PSRAM vs internal SRAM impacts.
include/micro_flac/flac_decoder.h Updates private decoder internals API (replaces inline Rice helper with decode_rice_partition).
examples/decode_benchmark/README.md Refreshes benchmark numbers and adds ESP32 internal/PSRAM comparison guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/crc.cpp Outdated
Comment thread src/pcm_packing.cpp Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@kahrendt kahrendt merged commit 8134d57 into main May 4, 2026
16 checks passed
@kahrendt kahrendt deleted the flac-optimizations branch May 4, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor Bumps minor version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants