Skip to content

S3 output buffer consumed by stale files after multiple failed retries #11759

@cypher7682

Description

@cypher7682

Bug Report

Describe the bug
On s3 failures, files are left stale on disk and thus do not relinquish output buffer space back by removing the file and decrementing current_buffer_size. Repeated errors cause stale output files to persist in the buffer which the output handler is completely unaware of, so errors accumulate into a full output buffer and can no longer process any data.

i.e. Occasional failures on S3 uploads (of which there are a lot) gradually stack up to consume every byte of StoreDirLimitSize until everything grinds to a complete halt and the pod gets killed.

This appears to be related to these lines:

In both cases:

  • s3_store_file_inactive is called
  • Which, in turn calls flb_fstore_file_inactive
  • This function closes the I/O but intentionally does not delete the file
  • Nothing behind s3_store_file_inactive liberates buffer space, even though all references to that file have been relinquished and the output has 'given up' with the file.
    • As opposed to s3_store_file_delete which does release buffer space: ctx->current_buffer_size -= s3_file->size; and deletes the file
    • Obviously, s3_store_file_inactive is not relinquishing buffer space because it's not deleting the files.

In conjunction with this, v4.2.3 also has an off-by-one error (fixed in later versions) so retry_limit = 1 is actually 'No retries', and this opt defaults to 1 rather than 5 as seen in previous versions. Thus, on v4.2.3 if you are not manually specifying a retryLimit > 1, transient failures become a catastrophic loss of records and the buffer will be permanently tainted with those lost records.

To Reproduce

  • Rubular link if applicable: u/k
  • Example log message if applicable:

There's lots of correlative bits of information here, so bear with me:

Files on disk:

-rw-------    1 root     root       1183744 Apr 29 11:41 12681897966366703112-14606954368872656564
-rw-------    1 root     root       3411968 Apr 29 11:41 15393042439759612772-10689248011551476248
-rw-------    1 root     root        102400 Apr 29 11:41 16404582273068644347-5967513073307950096
-rw-------    1 root     root      10391552 Apr 29 11:41 17968601020670848119-7239896667824836330
-rw-------    1 root     root          4096 Apr 29 11:41 17266151877237876228-6718229405376101401
-rw-------    1 root     root         36864 Apr 29 11:41 16956281325476515959-5472663707595791072
-rw-------    1 root     root      12128256 Apr 29 11:39 17968601020670848119-17888715470891049674
-rw-------    1 root     root      13570048 Apr 29 11:36 17968601020670848119-9719238430729505548
-rw-------    1 root     root      13242368 Apr 29 11:17 17968601020670848119-2361698986790864637
-rw-------    1 root     root      12292096 Apr 29 11:04 15393042439759612772-18069702356910709128
-rw-------    1 root     root      20221952 Apr 29 11:01 17968601020670848119-13084062489286803840
-rw-------    1 root     root      11636736 Apr 29 10:57 17968601020670848119-11644061896093015244
-rw-------    1 root     root      12652544 Apr 29 10:54 17968601020670848119-294124939407482723
-rw-------    1 root     root          4096 Apr 29 10:54 18346669338110208780-8330162164832928582
-rw-------    1 root     root      12324864 Apr 29 10:51 17968601020670848119-6978170714041475120
-rw-------    1 root     root      14684160 Apr 29 10:24 17968601020670848119-13808492304039359588
-rw-------    1 root     root      12062720 Apr 29 10:17 17968601020670848119-12665229349350099937
-rw-------    1 root     root      13275136 Apr 29 10:16 17968601020670848119-6725517038024379648
-rw-------    1 root     root      12324864 Apr 29 10:15 17968601020670848119-8189199610958239434
-rw-------    1 root     root      14028800 Apr 29 10:12 17968601020670848119-11088013945658484385
-rw-------    1 root     root      13799424 Apr 29 10:10 17968601020670848119-13726761029458994180
-rw-------    1 root     root        495616 Apr 29 10:10 16404582273068644347-11939250306710512735
-rw-------    1 root     root      13209600 Apr 29 10:09 17968601020670848119-17079909337996210440
-rw-------    1 root     root      10293248 Apr 29 09:58 17968601020670848119-5303520551613184777
-rw-------    1 root     root      10948608 Apr 29 09:50 17968601020670848119-6142455639810240160
-rw-------    1 root     root      12947456 Apr 29 09:35 15393042439759612772-2661160915295300259
-rw-------    1 root     root      19304448 Apr 29 09:04 15393042439759612772-4886553951790287162
-rw-------    1 root     root      20025344 Apr 29 09:03 17968601020670848119-11363852993441694475
-rw-------    1 root     root      20254720 Apr 29 09:02 17968601020670848119-11574945865466687512
-rw-------    1 root     root       8327168 Apr 29 08:58 15393042439759612772-17961139713936533472
-rw-------    1 root     root      11014144 Apr 29 08:44 17968601020670848119-10742523279820995919
-rw-------    1 root     root      12357632 Apr 29 08:38 17968601020670848119-13542879461489352312
-rw-------    1 root     root      15273984 Apr 29 08:22 17968601020670848119-8991173050528632184
-rw-------    1 root     root      10915840 Apr 29 08:18 17968601020670848119-3867850942261909169
-rw-------    1 root     root      11014144 Apr 29 08:17 17968601020670848119-2611234149650838141
-rw-------    1 root     root      10883072 Apr 29 08:14 17968601020670848119-3227245893658905412
-rw-------    1 root     root      11866112 Apr 29 08:10 17968601020670848119-16407279727121294508
-rw-------    1 root     root      19468288 Apr 29 08:00 17968601020670848119-1756037490726720884

Error logs for latest error (redacted):

[2026/04/29 11:39:11.873176864] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 has been assigned (recycled)
[2026/04/29 11:39:11.873191674] [debug] [http_client] not using http_proxy for header 
[2026/04/29 11:39:11.873207925] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2026/04/29 11:39:11.873821001] [error] [http_client] broken connection to s3.eu-central-1.amazonaws.com:443 ?
[2026/04/29 11:39:11.873834421] [debug] [aws_client] s3.eu-central-1.amazonaws.com: http_do=-1, HTTP Status: 0 
[2026/04/29 11:39:11.873849522] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 is now available
[2026/04/29 11:39:11.873857062] [debug] [aws_client] auto-retrying
[2026/04/29 11:39:11.873864952] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 has been assigned (recycled)
[2026/04/29 11:39:11.873871012] [debug] [http_client] not using http_proxy for header
[2026/04/29 11:39:11.873884672] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2026/04/29 11:39:11.968864211] [error] [http_client] broken connection to s3.eu-central-1.amazonaws.com:443 ? 
[2026/04/29 11:39:11.968878961] [debug] [aws_client] s3.eu-central-1.amazonaws.com: http_do=-1, HTTP Status: 0
[2026/04/29 11:39:11.968893262] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 is now available
[2026/04/29 11:39:11.968899502] [error] [output:s3:<name>] PutObject request failed 
[2026/04/29 11:39:11.969948899] [ warn] [output:s3:<name>] Chunk file failed to send 1 times, will not retry

We have debug logging on at the moment, so we don't have logs dating back for all the stale files - but the stale files on disk correlate quite nicely with errors reported in the stats:

Image
  • Steps to reproduce the problem:

I actually don't know how to reproduce it without temporarily breaking S3 outbounds or something? This is just something we are actively experiencing.

Expected behavior

Files should be deleted rather than left stale on disk. I suspect it would be more appropriate to use s3_store_file_delete in this case, unless there's a specific reason as to why fluent does not do this.

Screenshots
n/a. All screenshots etc are included with bug logs/info.

Your Environment

  • Version used: Fluent-bit v4.2.3 // Fluent-operator v3.5.0
  • Configuration:
# cluster config
[...]
    flushSeconds: 1
    hcErrorsCount: 5
    hcPeriod: 5
    hcRetryFailureCount: 5
    healthCheck: true
    httpServer: true
    logLevel: debug
    parsersFile: parsers.conf
    storage:
      backlogMemLimit: 50MB
      checksum: 'off'
      deleteIrrecoverableChunks: 'on'
      maxChunksUp: 128
      metrics: 'on'
      path: /host/fluent-bit-buffer/
      sync: normal
# s3 output config (redacted)
[...]
  s3:
    Bucket: [BUCKET]
    Compression: gzip
    Region: eu-central-1
    S3KeyFormat: [IRRELEVANT] 
    StaticFilePath: true
    StoreDirLimitSize: 500m
    TotalFileSize: 20M
    UploadTimeout: 60s
    UsePutObject: true
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes (EKS) v1.33.5
  • Server type and version:
  • Operating System and version:
  • Filters and plugins: Tail, but specifically S3 output relevant to this case.

Additional context
We have enabled RetryLimit: 5 on the output config, which appears to be smoothing over the transient failures quite nicely, and there are no longer stale files on disk. However, if a user is intentionally not retrying uploads or has not specified a retry limit on v4.2.3, they will be affected by this bug.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions