Optimizations/simplifications#7
Merged
Merged
Conversation
Note we do not slice by 8 which requires more tables with larger flash/cache pressure, but we just use the one table up to 8 times in one call to reduce the loop overhead.
…SANITIZE_OVERFLOW. FLAC_NOINLINE and FLAC_ASSUME lets us optimize hot loops better. FLAC_NO_SANITIZE_OVERFLOW will be useful for a potential fuzzer so that UBSan checks are disabled on the LPC restoration path which are completely unavoidable.
… a future fuzzer and remove some unnecessary brackets
…ion helper method (never inlined) to reduce register pressure.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving FLAC decode throughput (especially on ESP32-class targets) by tightening hot-path bit reading, reducing register pressure in Rice residual decoding, and adding a faster PCM packing path for aligned 24-bit stereo output.
Changes:
- Introduces a header-only local bit-reader (
BitReaderLocal+ helpers) and refactors residual Rice decoding into an out-of-lineddecode_rice_partition()loop. - Adds a 4-byte-aligned fast path for packed 24-bit stereo PCM output.
- Optimizes CRC16 update behavior (including a 32-bit-host unroll path) and updates internal/public benchmark documentation.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/README.md | Updates internal architecture docs to reflect new bit-reader helper header and behavior. |
| src/pcm_packing.cpp | Adds an aligned 24-bit stereo packing fast path and dispatch logic. |
| src/lpc.cpp | Applies FLAC_NO_SANITIZE_OVERFLOW to LPC restore paths for fuzzing/UBSan scenarios. |
| src/frame_header.cpp | Removes redundant reserved-bit check and documents that it’s validated earlier. |
| src/flac_decoder.cpp | Switches to bit_reader.h primitives, adds decode_rice_partition() to reduce register pressure. |
| src/decorrelation.cpp | Simplifies channel assignment branching for joint-stereo decorrelation. |
| src/crc.cpp | Refactors CRC16 update loop, adds 32-bit unrolled processing and keeps CRC in a 32-bit local. |
| src/compiler.h | Adds FLAC_NOINLINE, FLAC_ASSUME, and FLAC_NO_SANITIZE_OVERFLOW macros. |
| src/bit_reader.h | New header-only bit-reader implementation used by the decoder hot paths. |
| README.md | Updates headline performance table and notes about PSRAM vs internal SRAM impacts. |
| include/micro_flac/flac_decoder.h | Updates private decoder internals API (replaces inline Rice helper with decode_rice_partition). |
| examples/decode_benchmark/README.md | Refreshes benchmark numbers and adds ESP32 internal/PSRAM comparison guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizes the FLAC decoder. On an S3, the performance is improved by around 15% and on a P4 performance is improved by about 10%.
Adds optimized output path for 4-byte aligned buffers when outputting packed 24-bit samples
Simplifies the code in a few spots like removing unnecessary/redundant conditional checks and brackets
Unrolls the CRC loop by 8 on 32-bit platforms to amortize the loop increment cost (this isn't slicing by 8 like 64-bit platforms uses, but instead if just using the one table 8 times per loop)
Adds several compiler macros:
FLAC_NOINLINE,FLAC_ASSUME, andFLAC_NO_SANITIZE_OVERFLOWFLAC_NOINLINEandFLAC_ASSUMElets us optimize hot loops betterFLAC_NO_SANITIZE_OVERFLOWwill be useful for a potential fuzzer so that UBSan checks are disabled on the LPC restoration path which are completely unavoidableAdds a local bit reader helper. This let's the compiler keep the bit buffer in registers rather than constantly pushing/pulling the member variable
Optimizes by adding a
decode_rice_partitionhelper to reduce register pressure and which helps the compiler avoid using the stack for intermediate storageUpdates the internal documentation to describe the new bit reader helper header
Updates the benchmark numbers based on these optimizations