[Regression] 4.2.4: opensearch output sustained flush failures, ~60% ingestion rate drop vs 4.2.3

## Bug Report

**Describe the bug**

After upgrading from `cr.fluentbit.io/fluent/fluent-bit:4.2.3` to `4.2.4`, the `opensearch` output plugin produces sustained `failed to flush chunk` warnings and `cannot be retried` errors against a healthy OpenSearch 2.19.4 backend. Ingestion rate drops from ~480 docs/min (baseline at 4.2.3) to ~200 docs/min, with `Retry_Limit 3` exhausted and chunks discarded. OpenSearch itself logs no errors and accepts writes from other clients without issue. Rolling back the DaemonSet image to `4.2.3` (no other config change) immediately restores normal behavior: 0 flush errors, ingestion rate recovers to ~561 docs/min.

**To Reproduce**

Steps:
1. Run Fluent Bit `4.2.3` as a DaemonSet on a Kubernetes cluster, tailing `/var/log/containers/*.log` and shipping to an in-cluster OpenSearch via the `opensearch` output. Confirm steady ingestion.
2. Change only the image tag in the DaemonSet to `4.2.4` and apply.
3. After all DaemonSet pods roll to the new image, observe Fluent Bit logs filling with:

```
[engine] failed to flush chunk '1-<timestamp>.<seq>.flb', retry in N seconds: task_id=X, input=tail.0 > output=opensearch.0 (out_id=0)
[engine] chunk '1-<timestamp>.<seq>.flb' cannot be retried: task_id=X, input=tail.0 > output=opensearch.0
```

at a rate of 16+ warnings per 30 seconds (sustained, not a transient burst). Querying the OpenSearch daily index shows the doc-arrival rate roughly halved versus the same time-of-day at 4.2.3.

4. Roll the DaemonSet image back to `4.2.3` (no other change). Errors stop within one rollout, ingestion rate recovers immediately.

**Expected behavior**

`opensearch` output should ship chunks as efficiently in `4.2.4` as in `4.2.3`. No persistent flush failures against a healthy backend.

**Your Environment**

* **Version used**: Fluent Bit `4.2.4` (regressed from `4.2.3`)
* **Image**: `cr.fluentbit.io/fluent/fluent-bit:4.2.4` (official, amd64)
* **Configuration (output section)**:
  ```ini
  [OUTPUT]
      Name            opensearch
      Match           kube.*
      Host            opensearch.logging.svc.cluster.local
      Port            9200
      Index           fluent-bit-kube
      Type            _doc
      Suppress_Type_Name On
      Logstash_Format On
      Logstash_Prefix fluent-bit-kube
      Time_Key        @timestamp
      Retry_Limit     3
      tls             Off
  ```
* **Inputs/filters**: `tail` reading `/var/log/containers/*.log` + `kubernetes` filter (standard).
* **Kubernetes**: microk8s 1.35.0 (containerd://2.1.3), 8 nodes (Ubuntu 24.04 LTS).
* **OpenSearch**: `opensearchproject/opensearch:2.19.4`, single-node, plain HTTP on port 9200 (security plugin disabled), healthy throughout the regression. No relevant log entries on the OpenSearch side during the window. The cluster continues to accept writes from non-Fluent-Bit clients without issue.

**Additional context**

This is a real, measurable regression on a production homelab cluster, not a transient issue. Reproduced cleanly across all 8 Fluent Bit DaemonSet pods on different nodes simultaneously. Rollback to 4.2.3 is the only mitigation we've found.

Suspected scope based on the `v4.2.3...v4.2.4` diff (without bisecting):

- **PR #11519** (engine: use effective post-processor counts for output metrics) — modifies `flb_engine.c handle_output_event()` directly, the function processing every flush result, plus adds new task/route fields. The PR description states it's metrics-only, but the modifications are to the hot flush-handling path; a defect there could explain the observed retry/drop pattern.
- **PR #11693** (lib: monkey: upgrade to v1.8.8) — large library bump (286 additions) that may affect HTTP client behavior.

(Note: `4.2.3.1` is just a Dockerfile-only release with no runtime code change, so it's not a useful intermediate. As of writing, there is no `4.2.5` and the `4.2.x` branch appears effectively dormant — recent activity is on the `5.x` line, which we cannot adopt due to the separate `yyjson` parser incompatibility with non-JSON inputs.)

Happy to provide more telemetry, run a sandbox bisect if there's interest, or test a proposed fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] 4.2.4: opensearch output sustained flush failures, ~60% ingestion rate drop vs 4.2.3 #11799

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Regression] 4.2.4: opensearch output sustained flush failures, ~60% ingestion rate drop vs 4.2.3 #11799

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions