-
Notifications
You must be signed in to change notification settings - Fork 1k
Added Error Handling & Defensive Programming Guide #3539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
veblush
wants to merge
1
commit into
tensorflow:main
Choose a base branch
from
veblush:error-doc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+241
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,240 @@ | ||
| # Error Handling & Defensive Programming Guide | ||
|
|
||
| Balancing defensiveness with the extreme constraints of microcontrollers (where | ||
| every byte of Flash/RAM and every clock cycle matters) is a core design tension | ||
| in TFLM. Since TFLM targets environments that often lack memory protection (no | ||
| MMU), don't support C++ exceptions, and run on bare metal or simple RTOSs, a | ||
| "standard" defensive posture would result in unacceptable binary bloat and | ||
| performance degradation. | ||
|
|
||
| This guide provides an architectural framework for writing safe kernels and | ||
| offers recommendations for code reviews for contributors. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 1. Trust Boundaries: Where Data Comes From | ||
|
|
||
| To achieve high performance without sacrificing critical stability, TFLM | ||
| distinguishes between trusted and untrusted data sources. | ||
|
|
||
| ### The C++ API (Trusted) | ||
|
|
||
| When a firmware developer uses the public TFLM C++ API (e.g., instantiating a | ||
| `MicroInterpreter` or calling `Invoke()`), we treat the developer as trusted. | ||
|
|
||
| * **Recommendation:** The core API should not spend runtime cycles or binary | ||
| space checking for invalid parameters (e.g., passing `nullptr`) or incorrect | ||
| state-machine sequences. | ||
| * **Mechanism:** Rely entirely on `TFLITE_DCHECK`. The firmware engineer gets | ||
| an immediate assertion failure during local debugging if they misuse the | ||
| API. In release builds, these compile out, leaving zero overhead. | ||
|
|
||
| ### The FlatBuffer Model (Partially Trusted) | ||
|
|
||
| When dealing with the `.tflite` FlatBuffer model, we distinguish between two | ||
| types of malformation: | ||
|
|
||
| * **Corrupted FlatBuffer Files (Structural Malformation):** The core | ||
| `MicroInterpreter` assumes the FlatBuffer is structurally valid. We *do not* | ||
| perform standard bounds checking on FlatBuffer schema offsets or operator | ||
| tensor indices (e.g., if an operator requests tensor index `999` but the | ||
| model only has `10` tensors, a production TFLM build will read out of | ||
| bounds). To aid local debugging, structural checks are considered a | ||
| "good-to-have" feature but **should exclusively use `TFLITE_DCHECK`**. If | ||
| models are provided via an untrusted channel (e.g., OTA), the application | ||
| layer is entirely responsible for validating the integrity of the model | ||
| before passing it to TFLM. | ||
| * **Invalid Model Topologies (Semantic Malformation):** The runtime *should* | ||
| defend against TFLite Converter bugs or unsupported configurations (e.g., | ||
| wrong tensor shapes, unsupported types). This validation happens entirely in | ||
| the Setup phase. Because developers can use `TF_LITE_STRIP_ERROR_STRINGS` to | ||
| remove the string bloat in production, checking topologies in `Prepare` | ||
| provides safe fallback (e.g., rejecting a bad OTA update) without the severe | ||
| ROM penalty of embedded strings. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 2. Execution Phases: When Code Runs | ||
|
|
||
| TFLM distinguishes between setup (which runs once) and execution (which runs | ||
| continuously). | ||
|
|
||
| ### Phase 1: Setup & Initialization (`Prepare`) | ||
|
|
||
| During the `Prepare` phase, kernels should validate their inputs and parameters | ||
| to ensure `Eval` can run blindly and safely. We prioritize clear error messages | ||
| here (relying on `TF_LITE_STRIP_ERROR_STRINGS` to mitigate the ROM cost in | ||
| production) and we can afford to spend CPU cycles on validation. | ||
|
|
||
| **What to Validate (Use `TF_LITE_ENSURE`):** | ||
|
|
||
| * **Model-provided parameters:** Check the number of inputs/outputs, tensor | ||
| types, and tensor shapes. | ||
| * **Quantization parameters:** Explicitly check for invalid quantization | ||
| parameters (e.g., a `scale` of `0.0` or a `zero_point` out of bounds) that | ||
| could cause a divide-by-zero or overflow later in `Eval`. | ||
| * **Resource allocations:** Check if memory allocations (like | ||
| `AllocateTempInputTensor`) return `nullptr`. | ||
|
|
||
| ### Phase 2: Execution (`Eval`) | ||
|
|
||
| During the `Eval` phase, the runtime is vulnerable to data that causes hardware | ||
| traps or memory corruption. We should avoid spending cycles or ROM on redundant | ||
| checks. | ||
|
|
||
| **What to Defend Against (Recommended Validation):** | ||
|
|
||
| Kernels should actively defend against three specific runtime threats during | ||
| `Eval`: | ||
|
|
||
| 1. **Out-of-bounds Memory Access:** If an input tensor contains indices or | ||
| offsets generated at runtime (e.g., in `GATHER`, `STRIDED_SLICE`), these | ||
| **should** be bounds-checked at runtime using a raw `if (!valid) return | ||
| kTfLiteError;`. | ||
| 2. **Hardware Faults (Divide by Zero):** Kernels should mathematically protect | ||
| against divide-by-zero (e.g., if a divisor comes from an input tensor) or | ||
| explicitly check and return an error. | ||
| 3. **Infinite Loops:** All loops inside `Eval` *should* have a guaranteed | ||
| maximum bound to prevent data-dependent infinite hangs. | ||
|
|
||
| **What to Ignore (GIGO for Signal Data):** | ||
|
|
||
| * TFLM kernels are **strongly discouraged** from explicitly scanning general | ||
| input signal data (like images or audio) for `NaN`/`Inf`. For standard math | ||
| operations, garbage data simply results in garbage output (GIGO). | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 3. Macro Selection Guide | ||
|
|
||
| To strike the right balance between debuggability, code size (ROM), and | ||
| performance, follow this decision tree. | ||
|
|
||
| ### The Decision Tree | ||
|
|
||
| 1. **Are you writing a Unit Test?** | ||
| * Use GoogleTest-style macros like `EXPECT_EQ`, `EXPECT_NEAR`, etc. for | ||
| verifying test conditions. | ||
| 2. **Are you in `Init` or `Prepare`?** | ||
| * Use `TF_LITE_ENSURE`. We want to fail early with a clear log message if | ||
| the model is incompatible. | ||
| 3. **Are you in `Eval` (or a helper called by `Eval`)?** | ||
| * *Is it an invariant guaranteed by `Prepare`?* (e.g., "this pointer | ||
| cannot be null"). Use `TFLITE_DCHECK`. It costs nothing in production. | ||
| * *Is it validating control data from an input tensor?* (e.g., a dynamic | ||
| index). Use a raw `if (!condition) return kTfLiteError;`. This safely | ||
| prevents memory corruption without paying the ROM cost of a string | ||
| literal. | ||
| * *Are you propagating an error from a helper function?* Use | ||
| `TF_LITE_ENSURE_STATUS` or `TF_LITE_ENSURE_OK`. | ||
|
|
||
| ### Macro Cheat Sheet | ||
|
|
||
| Macro | Cost in Release | When to Use | ||
| :--------------------------------------------- | :------------------------------------------------------- | :---------- | ||
| `TFLITE_DCHECK`<br>`TFLITE_DCHECK_EQ` | **Zero** (Optimized out) | **Default for `Eval` invariants and FlatBuffer structural bounds checking.** | ||
| `if (!cond) return kTfLiteError;` | **Low** (Branch only) | **Default for validating control data in `Eval`.** | ||
| `TF_LITE_ENSURE_OK`<br>`TF_LITE_ENSURE_STATUS` | **Low** (Branch only) | **Default for propagating errors from helper functions.** | ||
| `TF_LITE_ENSURE`<br>`TF_LITE_ENSURE_EQ` | **High** (Branch + logs `__FILE__`, `__LINE__`, `#cond`) | **Default for `Prepare` / Setup.** Can be used in `Eval` *only* if the failure indicates an unrecoverable state-machine corruption; avoid for normal signal data out-of-bounds. | ||
| `TF_LITE_ENSURE_MSG` | **Highest** (Branch + custom string) | **Use Sparingly.** Only when the default `TF_LITE_ENSURE` error is too cryptic. | ||
| `TFLITE_CHECK` | **Fatal** (Calls `Abort()`) | **Avoid in core TFLM.** Halts the microcontroller. | ||
|
|
||
| > **The Hidden Cost of `TF_LITE_ENSURE`:** This macro expands to an `if` branch | ||
| > that calls `TF_LITE_KERNEL_LOG`. By default, this embeds `__FILE__`, | ||
| > `__LINE__`, and the stringified condition directly into the `.rodata` section | ||
| > of the binary. Every single invocation permanently consumes precious ROM. If | ||
| > you have multiple preconditions, combine them into a single | ||
| > `TF_LITE_ENSURE(context, a != nullptr && b != nullptr)`. **Note:** For | ||
| > production builds where ROM is severely constrained, firmware developers | ||
| > should define the `TF_LITE_STRIP_ERROR_STRINGS` macro to compile out these | ||
| > strings, reducing `TF_LITE_ENSURE` to a simple low-cost branch. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 4. Code Review Guidelines & Concrete Examples | ||
|
|
||
| How to address common scenarios in PRs: | ||
|
|
||
| ### Validating Framework Pointers | ||
|
|
||
| **Discouraged.** Please avoid wasting ROM checking the validity of the TFLM | ||
| framework itself. The `context` and `node` pointers passed by the runtime are | ||
| typically guaranteed to be valid. Notably, `TF_LITE_ENSURE(context, context != | ||
| nullptr)` is logically flawed: if `context` is actually `nullptr`, the macro | ||
| will dereference it to log the error, causing an immediate hardware fault. | ||
|
|
||
| ### Validating Public API Parameters | ||
|
|
||
| **Discouraged.** This penalizes production deployments for bugs that the | ||
| firmware developer should have caught during local testing. We recommend asking | ||
| the contributor to use `TFLITE_DCHECK(op_resolver != nullptr);` instead of | ||
| adding `if (op_resolver == nullptr) return kTfLiteError;` to the | ||
| `MicroInterpreter` API. | ||
|
|
||
| ### Preventing Null Pointer Dereferences | ||
|
|
||
| * **If in `Prepare` (Recommended):** It is standard practice to check for | ||
| nulls when pulling tensors from the context. | ||
|
|
||
| ```cpp | ||
| TfLiteTensor* operand = micro_context->AllocateTempInputTensor(node, kOperandTensor); | ||
| TF_LITE_ENSURE(context, operand != nullptr); | ||
| ``` | ||
|
|
||
| * **If in `Eval` (Discouraged):** Production builds shouldn't pay the cycle | ||
| cost for checks that can't fail at runtime if `Prepare` succeeded. Ask the | ||
| author to change it to `TFLITE_DCHECK`. | ||
|
|
||
| ```cpp | ||
| const TfLiteEvalTensor* input_id = tflite::micro::GetEvalInput(context, node, 0); | ||
| TFLITE_DCHECK(input_id != nullptr); | ||
| ``` | ||
|
|
||
| ### Preventing Buffer Overflows | ||
|
|
||
| * **If in `Prepare` (Recommended):** Use `TF_LITE_ENSURE` to validate that | ||
| tensor sizes are compatible. | ||
| * **If in `Eval`:** If the overflow comes from invalid *control data* (like an | ||
| input tensor providing indices), it *should* be validated in `Eval` to | ||
| prevent memory corruption. Use raw returns: | ||
|
|
||
| ```cpp | ||
| // e.g., third_party/tflite_micro/tensorflow/lite/micro/kernels/gather_nd.cc | ||
| // Note: use subtraction to prevent integer overflow! | ||
| if (from_pos < 0 || from_pos > params_flat_size - slice_size) { | ||
| return kTfLiteError; // Halts execution to prevent out-of-bounds memory read | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| If it's just an invariant guaranteed by `Prepare`, ask to hoist the check or | ||
| use `TFLITE_DCHECK`. | ||
| ``` | ||
|
|
||
| ### Fixing Fuzzer Crashes | ||
|
|
||
| Fuzzer fixes should align with the core philosophy. We evaluate them based on | ||
| the type of crash: | ||
|
|
||
| 1. **Corrupted FlatBuffer Files (Working As Intended):** If the fuzzer mutates | ||
| the FlatBuffer byte array so a `Tensor` offset points beyond the end of the | ||
| file (causing a segfault), **request changes if the PR uses `TF_LITE_ENSURE` | ||
| or raw `if` statements**. To keep the binary size as small as possible, we | ||
| prefer to omit the FlatBuffer verifier. However, **it is recommended to | ||
| accept the PR if it exclusively uses `TFLITE_DCHECK`** to catch the | ||
| out-of-bounds index. This treats the structural check purely as a zero-cost | ||
| developer aid during debugging. | ||
| 2. **Invalid Model Topologies (Fix in Prepare):** If the fuzzer generates a | ||
| valid FlatBuffer but modifies a `CONV_2D` operator to have 0 inputs instead | ||
| of the expected 3, **this is a good fix in `Prepare`** using | ||
| `TF_LITE_ENSURE_EQ(context, NumInputs(node), 3);`. | ||
| 3. **Static Math Crashes (Fix in Prepare):** If the fuzzer sets a quantization | ||
| `scale` parameter to `0.0` (causing a hardware trap during inference), | ||
| **this is a good fix in `Prepare`** using `TF_LITE_ENSURE(context, scale != | ||
| 0.0);`. | ||
| 4. **Dynamic Math/Data Crashes (Fix in Eval):** If the fuzzer provides input | ||
| data to a `GATHER` op with an index of `999` (causing out-of-bounds | ||
| corruption), or a divisor tensor evaluates to `0` (causing a divide-by-zero | ||
| trap), **this is a good fix in `Eval`**. Prefer a raw `if (index >= 10) | ||
| return kTfLiteError;` to save ROM (avoid using `TF_LITE_ENSURE` here just to | ||
| log the error; save the ROM). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 The opening paragraph 👍