Skip to content

[Regression] 4.2.4: opensearch output sustained flush failures, ~60% ingestion rate drop vs 4.2.3 #11799

@jmservices

Description

@jmservices

Bug Report

Describe the bug

After upgrading from cr.fluentbit.io/fluent/fluent-bit:4.2.3 to 4.2.4, the opensearch output plugin produces sustained failed to flush chunk warnings and cannot be retried errors against a healthy OpenSearch 2.19.4 backend. Ingestion rate drops from ~480 docs/min (baseline at 4.2.3) to ~200 docs/min, with Retry_Limit 3 exhausted and chunks discarded. OpenSearch itself logs no errors and accepts writes from other clients without issue. Rolling back the DaemonSet image to 4.2.3 (no other config change) immediately restores normal behavior: 0 flush errors, ingestion rate recovers to ~561 docs/min.

To Reproduce

Steps:

  1. Run Fluent Bit 4.2.3 as a DaemonSet on a Kubernetes cluster, tailing /var/log/containers/*.log and shipping to an in-cluster OpenSearch via the opensearch output. Confirm steady ingestion.
  2. Change only the image tag in the DaemonSet to 4.2.4 and apply.
  3. After all DaemonSet pods roll to the new image, observe Fluent Bit logs filling with:
[engine] failed to flush chunk '1-<timestamp>.<seq>.flb', retry in N seconds: task_id=X, input=tail.0 > output=opensearch.0 (out_id=0)
[engine] chunk '1-<timestamp>.<seq>.flb' cannot be retried: task_id=X, input=tail.0 > output=opensearch.0

at a rate of 16+ warnings per 30 seconds (sustained, not a transient burst). Querying the OpenSearch daily index shows the doc-arrival rate roughly halved versus the same time-of-day at 4.2.3.

  1. Roll the DaemonSet image back to 4.2.3 (no other change). Errors stop within one rollout, ingestion rate recovers immediately.

Expected behavior

opensearch output should ship chunks as efficiently in 4.2.4 as in 4.2.3. No persistent flush failures against a healthy backend.

Your Environment

  • Version used: Fluent Bit 4.2.4 (regressed from 4.2.3)
  • Image: cr.fluentbit.io/fluent/fluent-bit:4.2.4 (official, amd64)
  • Configuration (output section):
    [OUTPUT]
        Name            opensearch
        Match           kube.*
        Host            opensearch.logging.svc.cluster.local
        Port            9200
        Index           fluent-bit-kube
        Type            _doc
        Suppress_Type_Name On
        Logstash_Format On
        Logstash_Prefix fluent-bit-kube
        Time_Key        @timestamp
        Retry_Limit     3
        tls             Off
  • Inputs/filters: tail reading /var/log/containers/*.log + kubernetes filter (standard).
  • Kubernetes: microk8s 1.35.0 (containerd://2.1.3), 8 nodes (Ubuntu 24.04 LTS).
  • OpenSearch: opensearchproject/opensearch:2.19.4, single-node, plain HTTP on port 9200 (security plugin disabled), healthy throughout the regression. No relevant log entries on the OpenSearch side during the window. The cluster continues to accept writes from non-Fluent-Bit clients without issue.

Additional context

This is a real, measurable regression on a production homelab cluster, not a transient issue. Reproduced cleanly across all 8 Fluent Bit DaemonSet pods on different nodes simultaneously. Rollback to 4.2.3 is the only mitigation we've found.

Suspected scope based on the v4.2.3...v4.2.4 diff (without bisecting):

  • PR engine: use effective post-processor counts for output metrics (v4.2) #11519 (engine: use effective post-processor counts for output metrics) — modifies flb_engine.c handle_output_event() directly, the function processing every flush result, plus adds new task/route fields. The PR description states it's metrics-only, but the modifications are to the hot flush-handling path; a defect there could explain the observed retry/drop pattern.
  • PR lib: monkey: upgrade to v1.8.8 #11693 (lib: monkey: upgrade to v1.8.8) — large library bump (286 additions) that may affect HTTP client behavior.

(Note: 4.2.3.1 is just a Dockerfile-only release with no runtime code change, so it's not a useful intermediate. As of writing, there is no 4.2.5 and the 4.2.x branch appears effectively dormant — recent activity is on the 5.x line, which we cannot adopt due to the separate yyjson parser incompatibility with non-JSON inputs.)

Happy to provide more telemetry, run a sandbox bisect if there's interest, or test a proposed fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions