You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading from cr.fluentbit.io/fluent/fluent-bit:4.2.3 to 4.2.4, the opensearch output plugin produces sustained failed to flush chunk warnings and cannot be retried errors against a healthy OpenSearch 2.19.4 backend. Ingestion rate drops from ~480 docs/min (baseline at 4.2.3) to ~200 docs/min, with Retry_Limit 3 exhausted and chunks discarded. OpenSearch itself logs no errors and accepts writes from other clients without issue. Rolling back the DaemonSet image to 4.2.3 (no other config change) immediately restores normal behavior: 0 flush errors, ingestion rate recovers to ~561 docs/min.
To Reproduce
Steps:
Run Fluent Bit 4.2.3 as a DaemonSet on a Kubernetes cluster, tailing /var/log/containers/*.log and shipping to an in-cluster OpenSearch via the opensearch output. Confirm steady ingestion.
Change only the image tag in the DaemonSet to 4.2.4 and apply.
After all DaemonSet pods roll to the new image, observe Fluent Bit logs filling with:
[engine] failed to flush chunk '1-<timestamp>.<seq>.flb', retry in N seconds: task_id=X, input=tail.0 > output=opensearch.0 (out_id=0)
[engine] chunk '1-<timestamp>.<seq>.flb' cannot be retried: task_id=X, input=tail.0 > output=opensearch.0
at a rate of 16+ warnings per 30 seconds (sustained, not a transient burst). Querying the OpenSearch daily index shows the doc-arrival rate roughly halved versus the same time-of-day at 4.2.3.
Roll the DaemonSet image back to 4.2.3 (no other change). Errors stop within one rollout, ingestion rate recovers immediately.
Expected behavior
opensearch output should ship chunks as efficiently in 4.2.4 as in 4.2.3. No persistent flush failures against a healthy backend.
Your Environment
Version used: Fluent Bit 4.2.4 (regressed from 4.2.3)
[OUTPUT]
Name opensearch
Match kube.*
Host opensearch.logging.svc.cluster.local
Port 9200
Index fluent-bit-kube
Type _doc
Suppress_Type_Name On
Logstash_Format On
Logstash_Prefix fluent-bit-kube
Time_Key @timestamp
Retry_Limit 3
tls Off
OpenSearch: opensearchproject/opensearch:2.19.4, single-node, plain HTTP on port 9200 (security plugin disabled), healthy throughout the regression. No relevant log entries on the OpenSearch side during the window. The cluster continues to accept writes from non-Fluent-Bit clients without issue.
Additional context
This is a real, measurable regression on a production homelab cluster, not a transient issue. Reproduced cleanly across all 8 Fluent Bit DaemonSet pods on different nodes simultaneously. Rollback to 4.2.3 is the only mitigation we've found.
Suspected scope based on the v4.2.3...v4.2.4 diff (without bisecting):
PR engine: use effective post-processor counts for output metrics (v4.2) #11519 (engine: use effective post-processor counts for output metrics) — modifies flb_engine.c handle_output_event() directly, the function processing every flush result, plus adds new task/route fields. The PR description states it's metrics-only, but the modifications are to the hot flush-handling path; a defect there could explain the observed retry/drop pattern.
PR lib: monkey: upgrade to v1.8.8 #11693 (lib: monkey: upgrade to v1.8.8) — large library bump (286 additions) that may affect HTTP client behavior.
(Note: 4.2.3.1 is just a Dockerfile-only release with no runtime code change, so it's not a useful intermediate. As of writing, there is no 4.2.5 and the 4.2.x branch appears effectively dormant — recent activity is on the 5.x line, which we cannot adopt due to the separate yyjson parser incompatibility with non-JSON inputs.)
Happy to provide more telemetry, run a sandbox bisect if there's interest, or test a proposed fix.
Bug Report
Describe the bug
After upgrading from
cr.fluentbit.io/fluent/fluent-bit:4.2.3to4.2.4, theopensearchoutput plugin produces sustainedfailed to flush chunkwarnings andcannot be retriederrors against a healthy OpenSearch 2.19.4 backend. Ingestion rate drops from ~480 docs/min (baseline at 4.2.3) to ~200 docs/min, withRetry_Limit 3exhausted and chunks discarded. OpenSearch itself logs no errors and accepts writes from other clients without issue. Rolling back the DaemonSet image to4.2.3(no other config change) immediately restores normal behavior: 0 flush errors, ingestion rate recovers to ~561 docs/min.To Reproduce
Steps:
4.2.3as a DaemonSet on a Kubernetes cluster, tailing/var/log/containers/*.logand shipping to an in-cluster OpenSearch via theopensearchoutput. Confirm steady ingestion.4.2.4and apply.at a rate of 16+ warnings per 30 seconds (sustained, not a transient burst). Querying the OpenSearch daily index shows the doc-arrival rate roughly halved versus the same time-of-day at 4.2.3.
4.2.3(no other change). Errors stop within one rollout, ingestion rate recovers immediately.Expected behavior
opensearchoutput should ship chunks as efficiently in4.2.4as in4.2.3. No persistent flush failures against a healthy backend.Your Environment
4.2.4(regressed from4.2.3)cr.fluentbit.io/fluent/fluent-bit:4.2.4(official, amd64)[OUTPUT] Name opensearch Match kube.* Host opensearch.logging.svc.cluster.local Port 9200 Index fluent-bit-kube Type _doc Suppress_Type_Name On Logstash_Format On Logstash_Prefix fluent-bit-kube Time_Key @timestamp Retry_Limit 3 tls Offtailreading/var/log/containers/*.log+kubernetesfilter (standard).opensearchproject/opensearch:2.19.4, single-node, plain HTTP on port 9200 (security plugin disabled), healthy throughout the regression. No relevant log entries on the OpenSearch side during the window. The cluster continues to accept writes from non-Fluent-Bit clients without issue.Additional context
This is a real, measurable regression on a production homelab cluster, not a transient issue. Reproduced cleanly across all 8 Fluent Bit DaemonSet pods on different nodes simultaneously. Rollback to 4.2.3 is the only mitigation we've found.
Suspected scope based on the
v4.2.3...v4.2.4diff (without bisecting):flb_engine.c handle_output_event()directly, the function processing every flush result, plus adds new task/route fields. The PR description states it's metrics-only, but the modifications are to the hot flush-handling path; a defect there could explain the observed retry/drop pattern.(Note:
4.2.3.1is just a Dockerfile-only release with no runtime code change, so it's not a useful intermediate. As of writing, there is no4.2.5and the4.2.xbranch appears effectively dormant — recent activity is on the5.xline, which we cannot adopt due to the separateyyjsonparser incompatibility with non-JSON inputs.)Happy to provide more telemetry, run a sandbox bisect if there's interest, or test a proposed fix.