diff --git a/README.md b/README.md index a9884dfa445..4c71e188a90 100644 --- a/README.md +++ b/README.md @@ -78,6 +78,7 @@ The following resources may also be useful: * [Benchmarks](tensorflow/lite/micro/benchmarks/README.md) * [Profiling](tensorflow/lite/micro/docs/profiling.md) * [Memory Management](tensorflow/lite/micro/docs/memory_management.md) + * [Error Handling](tensorflow/lite/micro/docs/error_handling_guide.md) * [Logging](tensorflow/lite/micro/docs/logging.md) * [Porting Reference Kernels from TfLite to TFLM](tensorflow/lite/micro/docs/porting_reference_ops.md) * [Optimized Kernel Implementations](tensorflow/lite/micro/docs/optimized_kernel_implementations.md) diff --git a/tensorflow/lite/micro/docs/error_handling_guide.md b/tensorflow/lite/micro/docs/error_handling_guide.md new file mode 100644 index 00000000000..1f48071ce99 --- /dev/null +++ b/tensorflow/lite/micro/docs/error_handling_guide.md @@ -0,0 +1,240 @@ +# Error Handling & Defensive Programming Guide + +Balancing defensiveness with the extreme constraints of microcontrollers (where +every byte of Flash/RAM and every clock cycle matters) is a core design tension +in TFLM. Since TFLM targets environments that often lack memory protection (no +MMU), don't support C++ exceptions, and run on bare metal or simple RTOSs, a +"standard" defensive posture would result in unacceptable binary bloat and +performance degradation. + +This guide provides an architectural framework for writing safe kernels and +offers recommendations for code reviews for contributors. + +-------------------------------------------------------------------------------- + +## 1. Trust Boundaries: Where Data Comes From + +To achieve high performance without sacrificing critical stability, TFLM +distinguishes between trusted and untrusted data sources. + +### The C++ API (Trusted) + +When a firmware developer uses the public TFLM C++ API (e.g., instantiating a +`MicroInterpreter` or calling `Invoke()`), we treat the developer as trusted. + +* **Recommendation:** The core API should not spend runtime cycles or binary + space checking for invalid parameters (e.g., passing `nullptr`) or incorrect + state-machine sequences. +* **Mechanism:** Rely entirely on `TFLITE_DCHECK`. The firmware engineer gets + an immediate assertion failure during local debugging if they misuse the + API. In release builds, these compile out, leaving zero overhead. + +### The FlatBuffer Model (Partially Trusted) + +When dealing with the `.tflite` FlatBuffer model, we distinguish between two +types of malformation: + +* **Corrupted FlatBuffer Files (Structural Malformation):** The core + `MicroInterpreter` assumes the FlatBuffer is structurally valid. We *do not* + perform standard bounds checking on FlatBuffer schema offsets or operator + tensor indices (e.g., if an operator requests tensor index `999` but the + model only has `10` tensors, a production TFLM build will read out of + bounds). To aid local debugging, structural checks are considered a + "good-to-have" feature but **should exclusively use `TFLITE_DCHECK`**. If + models are provided via an untrusted channel (e.g., OTA), the application + layer is entirely responsible for validating the integrity of the model + before passing it to TFLM. +* **Invalid Model Topologies (Semantic Malformation):** The runtime *should* + defend against TFLite Converter bugs or unsupported configurations (e.g., + wrong tensor shapes, unsupported types). This validation happens entirely in + the Setup phase. Because developers can use `TF_LITE_STRIP_ERROR_STRINGS` to + remove the string bloat in production, checking topologies in `Prepare` + provides safe fallback (e.g., rejecting a bad OTA update) without the severe + ROM penalty of embedded strings. + +-------------------------------------------------------------------------------- + +## 2. Execution Phases: When Code Runs + +TFLM distinguishes between setup (which runs once) and execution (which runs +continuously). + +### Phase 1: Setup & Initialization (`Prepare`) + +During the `Prepare` phase, kernels should validate their inputs and parameters +to ensure `Eval` can run blindly and safely. We prioritize clear error messages +here (relying on `TF_LITE_STRIP_ERROR_STRINGS` to mitigate the ROM cost in +production) and we can afford to spend CPU cycles on validation. + +**What to Validate (Use `TF_LITE_ENSURE`):** + +* **Model-provided parameters:** Check the number of inputs/outputs, tensor + types, and tensor shapes. +* **Quantization parameters:** Explicitly check for invalid quantization + parameters (e.g., a `scale` of `0.0` or a `zero_point` out of bounds) that + could cause a divide-by-zero or overflow later in `Eval`. +* **Resource allocations:** Check if memory allocations (like + `AllocateTempInputTensor`) return `nullptr`. + +### Phase 2: Execution (`Eval`) + +During the `Eval` phase, the runtime is vulnerable to data that causes hardware +traps or memory corruption. We should avoid spending cycles or ROM on redundant +checks. + +**What to Defend Against (Recommended Validation):** + +Kernels should actively defend against three specific runtime threats during +`Eval`: + +1. **Out-of-bounds Memory Access:** If an input tensor contains indices or + offsets generated at runtime (e.g., in `GATHER`, `STRIDED_SLICE`), these + **should** be bounds-checked at runtime using a raw `if (!valid) return + kTfLiteError;`. +2. **Hardware Faults (Divide by Zero):** Kernels should mathematically protect + against divide-by-zero (e.g., if a divisor comes from an input tensor) or + explicitly check and return an error. +3. **Infinite Loops:** All loops inside `Eval` *should* have a guaranteed + maximum bound to prevent data-dependent infinite hangs. + +**What to Ignore (GIGO for Signal Data):** + +* TFLM kernels are **strongly discouraged** from explicitly scanning general + input signal data (like images or audio) for `NaN`/`Inf`. For standard math + operations, garbage data simply results in garbage output (GIGO). + +-------------------------------------------------------------------------------- + +## 3. Macro Selection Guide + +To strike the right balance between debuggability, code size (ROM), and +performance, follow this decision tree. + +### The Decision Tree + +1. **Are you writing a Unit Test?** + * Use GoogleTest-style macros like `EXPECT_EQ`, `EXPECT_NEAR`, etc. for + verifying test conditions. +2. **Are you in `Init` or `Prepare`?** + * Use `TF_LITE_ENSURE`. We want to fail early with a clear log message if + the model is incompatible. +3. **Are you in `Eval` (or a helper called by `Eval`)?** + * *Is it an invariant guaranteed by `Prepare`?* (e.g., "this pointer + cannot be null"). Use `TFLITE_DCHECK`. It costs nothing in production. + * *Is it validating control data from an input tensor?* (e.g., a dynamic + index). Use a raw `if (!condition) return kTfLiteError;`. This safely + prevents memory corruption without paying the ROM cost of a string + literal. + * *Are you propagating an error from a helper function?* Use + `TF_LITE_ENSURE_STATUS` or `TF_LITE_ENSURE_OK`. + +### Macro Cheat Sheet + +Macro | Cost in Release | When to Use +:--------------------------------------------- | :------------------------------------------------------- | :---------- +`TFLITE_DCHECK`
`TFLITE_DCHECK_EQ` | **Zero** (Optimized out) | **Default for `Eval` invariants and FlatBuffer structural bounds checking.** +`if (!cond) return kTfLiteError;` | **Low** (Branch only) | **Default for validating control data in `Eval`.** +`TF_LITE_ENSURE_OK`
`TF_LITE_ENSURE_STATUS` | **Low** (Branch only) | **Default for propagating errors from helper functions.** +`TF_LITE_ENSURE`
`TF_LITE_ENSURE_EQ` | **High** (Branch + logs `__FILE__`, `__LINE__`, `#cond`) | **Default for `Prepare` / Setup.** Can be used in `Eval` *only* if the failure indicates an unrecoverable state-machine corruption; avoid for normal signal data out-of-bounds. +`TF_LITE_ENSURE_MSG` | **Highest** (Branch + custom string) | **Use Sparingly.** Only when the default `TF_LITE_ENSURE` error is too cryptic. +`TFLITE_CHECK` | **Fatal** (Calls `Abort()`) | **Avoid in core TFLM.** Halts the microcontroller. + +> **The Hidden Cost of `TF_LITE_ENSURE`:** This macro expands to an `if` branch +> that calls `TF_LITE_KERNEL_LOG`. By default, this embeds `__FILE__`, +> `__LINE__`, and the stringified condition directly into the `.rodata` section +> of the binary. Every single invocation permanently consumes precious ROM. If +> you have multiple preconditions, combine them into a single +> `TF_LITE_ENSURE(context, a != nullptr && b != nullptr)`. **Note:** For +> production builds where ROM is severely constrained, firmware developers +> should define the `TF_LITE_STRIP_ERROR_STRINGS` macro to compile out these +> strings, reducing `TF_LITE_ENSURE` to a simple low-cost branch. + +-------------------------------------------------------------------------------- + +## 4. Code Review Guidelines & Concrete Examples + +How to address common scenarios in PRs: + +### Validating Framework Pointers + +**Discouraged.** Please avoid wasting ROM checking the validity of the TFLM +framework itself. The `context` and `node` pointers passed by the runtime are +typically guaranteed to be valid. Notably, `TF_LITE_ENSURE(context, context != +nullptr)` is logically flawed: if `context` is actually `nullptr`, the macro +will dereference it to log the error, causing an immediate hardware fault. + +### Validating Public API Parameters + +**Discouraged.** This penalizes production deployments for bugs that the +firmware developer should have caught during local testing. We recommend asking +the contributor to use `TFLITE_DCHECK(op_resolver != nullptr);` instead of +adding `if (op_resolver == nullptr) return kTfLiteError;` to the +`MicroInterpreter` API. + +### Preventing Null Pointer Dereferences + +* **If in `Prepare` (Recommended):** It is standard practice to check for + nulls when pulling tensors from the context. + +```cpp +TfLiteTensor* operand = micro_context->AllocateTempInputTensor(node, kOperandTensor); +TF_LITE_ENSURE(context, operand != nullptr); +``` + +* **If in `Eval` (Discouraged):** Production builds shouldn't pay the cycle + cost for checks that can't fail at runtime if `Prepare` succeeded. Ask the + author to change it to `TFLITE_DCHECK`. + +```cpp +const TfLiteEvalTensor* input_id = tflite::micro::GetEvalInput(context, node, 0); +TFLITE_DCHECK(input_id != nullptr); +``` + +### Preventing Buffer Overflows + +* **If in `Prepare` (Recommended):** Use `TF_LITE_ENSURE` to validate that + tensor sizes are compatible. +* **If in `Eval`:** If the overflow comes from invalid *control data* (like an + input tensor providing indices), it *should* be validated in `Eval` to + prevent memory corruption. Use raw returns: + +```cpp +// e.g., third_party/tflite_micro/tensorflow/lite/micro/kernels/gather_nd.cc +// Note: use subtraction to prevent integer overflow! +if (from_pos < 0 || from_pos > params_flat_size - slice_size) { + return kTfLiteError; // Halts execution to prevent out-of-bounds memory read +} +``` + +``` +If it's just an invariant guaranteed by `Prepare`, ask to hoist the check or +use `TFLITE_DCHECK`. +``` + +### Fixing Fuzzer Crashes + +Fuzzer fixes should align with the core philosophy. We evaluate them based on +the type of crash: + +1. **Corrupted FlatBuffer Files (Working As Intended):** If the fuzzer mutates + the FlatBuffer byte array so a `Tensor` offset points beyond the end of the + file (causing a segfault), **request changes if the PR uses `TF_LITE_ENSURE` + or raw `if` statements**. To keep the binary size as small as possible, we + prefer to omit the FlatBuffer verifier. However, **it is recommended to + accept the PR if it exclusively uses `TFLITE_DCHECK`** to catch the + out-of-bounds index. This treats the structural check purely as a zero-cost + developer aid during debugging. +2. **Invalid Model Topologies (Fix in Prepare):** If the fuzzer generates a + valid FlatBuffer but modifies a `CONV_2D` operator to have 0 inputs instead + of the expected 3, **this is a good fix in `Prepare`** using + `TF_LITE_ENSURE_EQ(context, NumInputs(node), 3);`. +3. **Static Math Crashes (Fix in Prepare):** If the fuzzer sets a quantization + `scale` parameter to `0.0` (causing a hardware trap during inference), + **this is a good fix in `Prepare`** using `TF_LITE_ENSURE(context, scale != + 0.0);`. +4. **Dynamic Math/Data Crashes (Fix in Eval):** If the fuzzer provides input + data to a `GATHER` op with an index of `999` (causing out-of-bounds + corruption), or a divisor tensor evaluates to `0` (causing a divide-by-zero + trap), **this is a good fix in `Eval`**. Prefer a raw `if (index >= 10) + return kTfLiteError;` to save ROM (avoid using `TF_LITE_ENSURE` here just to + log the error; save the ROM).