Bug Report
Describe the bug
On s3 failures, files are left stale on disk and thus do not relinquish output buffer space back by removing the file and decrementing current_buffer_size. Repeated errors cause stale output files to persist in the buffer which the output handler is completely unaware of, so errors accumulate into a full output buffer and can no longer process any data.
i.e. Occasional failures on S3 uploads (of which there are a lot) gradually stack up to consume every byte of StoreDirLimitSize until everything grinds to a complete halt and the pod gets killed.
This appears to be related to these lines:
In both cases:
s3_store_file_inactive is called
- Which, in turn calls
flb_fstore_file_inactive
- This function closes the I/O but intentionally does not delete the file
- Nothing behind
s3_store_file_inactive liberates buffer space, even though all references to that file have been relinquished and the output has 'given up' with the file.
- As opposed to
s3_store_file_delete which does release buffer space: ctx->current_buffer_size -= s3_file->size; and deletes the file
- Obviously,
s3_store_file_inactive is not relinquishing buffer space because it's not deleting the files.
In conjunction with this, v4.2.3 also has an off-by-one error (fixed in later versions) so retry_limit = 1 is actually 'No retries', and this opt defaults to 1 rather than 5 as seen in previous versions. Thus, on v4.2.3 if you are not manually specifying a retryLimit > 1, transient failures become a catastrophic loss of records and the buffer will be permanently tainted with those lost records.
To Reproduce
- Rubular link if applicable: u/k
- Example log message if applicable:
There's lots of correlative bits of information here, so bear with me:
Files on disk:
-rw------- 1 root root 1183744 Apr 29 11:41 12681897966366703112-14606954368872656564
-rw------- 1 root root 3411968 Apr 29 11:41 15393042439759612772-10689248011551476248
-rw------- 1 root root 102400 Apr 29 11:41 16404582273068644347-5967513073307950096
-rw------- 1 root root 10391552 Apr 29 11:41 17968601020670848119-7239896667824836330
-rw------- 1 root root 4096 Apr 29 11:41 17266151877237876228-6718229405376101401
-rw------- 1 root root 36864 Apr 29 11:41 16956281325476515959-5472663707595791072
-rw------- 1 root root 12128256 Apr 29 11:39 17968601020670848119-17888715470891049674
-rw------- 1 root root 13570048 Apr 29 11:36 17968601020670848119-9719238430729505548
-rw------- 1 root root 13242368 Apr 29 11:17 17968601020670848119-2361698986790864637
-rw------- 1 root root 12292096 Apr 29 11:04 15393042439759612772-18069702356910709128
-rw------- 1 root root 20221952 Apr 29 11:01 17968601020670848119-13084062489286803840
-rw------- 1 root root 11636736 Apr 29 10:57 17968601020670848119-11644061896093015244
-rw------- 1 root root 12652544 Apr 29 10:54 17968601020670848119-294124939407482723
-rw------- 1 root root 4096 Apr 29 10:54 18346669338110208780-8330162164832928582
-rw------- 1 root root 12324864 Apr 29 10:51 17968601020670848119-6978170714041475120
-rw------- 1 root root 14684160 Apr 29 10:24 17968601020670848119-13808492304039359588
-rw------- 1 root root 12062720 Apr 29 10:17 17968601020670848119-12665229349350099937
-rw------- 1 root root 13275136 Apr 29 10:16 17968601020670848119-6725517038024379648
-rw------- 1 root root 12324864 Apr 29 10:15 17968601020670848119-8189199610958239434
-rw------- 1 root root 14028800 Apr 29 10:12 17968601020670848119-11088013945658484385
-rw------- 1 root root 13799424 Apr 29 10:10 17968601020670848119-13726761029458994180
-rw------- 1 root root 495616 Apr 29 10:10 16404582273068644347-11939250306710512735
-rw------- 1 root root 13209600 Apr 29 10:09 17968601020670848119-17079909337996210440
-rw------- 1 root root 10293248 Apr 29 09:58 17968601020670848119-5303520551613184777
-rw------- 1 root root 10948608 Apr 29 09:50 17968601020670848119-6142455639810240160
-rw------- 1 root root 12947456 Apr 29 09:35 15393042439759612772-2661160915295300259
-rw------- 1 root root 19304448 Apr 29 09:04 15393042439759612772-4886553951790287162
-rw------- 1 root root 20025344 Apr 29 09:03 17968601020670848119-11363852993441694475
-rw------- 1 root root 20254720 Apr 29 09:02 17968601020670848119-11574945865466687512
-rw------- 1 root root 8327168 Apr 29 08:58 15393042439759612772-17961139713936533472
-rw------- 1 root root 11014144 Apr 29 08:44 17968601020670848119-10742523279820995919
-rw------- 1 root root 12357632 Apr 29 08:38 17968601020670848119-13542879461489352312
-rw------- 1 root root 15273984 Apr 29 08:22 17968601020670848119-8991173050528632184
-rw------- 1 root root 10915840 Apr 29 08:18 17968601020670848119-3867850942261909169
-rw------- 1 root root 11014144 Apr 29 08:17 17968601020670848119-2611234149650838141
-rw------- 1 root root 10883072 Apr 29 08:14 17968601020670848119-3227245893658905412
-rw------- 1 root root 11866112 Apr 29 08:10 17968601020670848119-16407279727121294508
-rw------- 1 root root 19468288 Apr 29 08:00 17968601020670848119-1756037490726720884
Error logs for latest error (redacted):
[2026/04/29 11:39:11.873176864] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 has been assigned (recycled)
[2026/04/29 11:39:11.873191674] [debug] [http_client] not using http_proxy for header
[2026/04/29 11:39:11.873207925] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2026/04/29 11:39:11.873821001] [error] [http_client] broken connection to s3.eu-central-1.amazonaws.com:443 ?
[2026/04/29 11:39:11.873834421] [debug] [aws_client] s3.eu-central-1.amazonaws.com: http_do=-1, HTTP Status: 0
[2026/04/29 11:39:11.873849522] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 is now available
[2026/04/29 11:39:11.873857062] [debug] [aws_client] auto-retrying
[2026/04/29 11:39:11.873864952] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 has been assigned (recycled)
[2026/04/29 11:39:11.873871012] [debug] [http_client] not using http_proxy for header
[2026/04/29 11:39:11.873884672] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2026/04/29 11:39:11.968864211] [error] [http_client] broken connection to s3.eu-central-1.amazonaws.com:443 ?
[2026/04/29 11:39:11.968878961] [debug] [aws_client] s3.eu-central-1.amazonaws.com: http_do=-1, HTTP Status: 0
[2026/04/29 11:39:11.968893262] [debug] [upstream] KA connection #45 to s3.eu-central-1.amazonaws.com:443 is now available
[2026/04/29 11:39:11.968899502] [error] [output:s3:<name>] PutObject request failed
[2026/04/29 11:39:11.969948899] [ warn] [output:s3:<name>] Chunk file failed to send 1 times, will not retry
We have debug logging on at the moment, so we don't have logs dating back for all the stale files - but the stale files on disk correlate quite nicely with errors reported in the stats:
- Steps to reproduce the problem:
I actually don't know how to reproduce it without temporarily breaking S3 outbounds or something? This is just something we are actively experiencing.
Expected behavior
Files should be deleted rather than left stale on disk. I suspect it would be more appropriate to use s3_store_file_delete in this case, unless there's a specific reason as to why fluent does not do this.
Screenshots
n/a. All screenshots etc are included with bug logs/info.
Your Environment
- Version used: Fluent-bit v4.2.3 // Fluent-operator v3.5.0
- Configuration:
# cluster config
[...]
flushSeconds: 1
hcErrorsCount: 5
hcPeriod: 5
hcRetryFailureCount: 5
healthCheck: true
httpServer: true
logLevel: debug
parsersFile: parsers.conf
storage:
backlogMemLimit: 50MB
checksum: 'off'
deleteIrrecoverableChunks: 'on'
maxChunksUp: 128
metrics: 'on'
path: /host/fluent-bit-buffer/
sync: normal
# s3 output config (redacted)
[...]
s3:
Bucket: [BUCKET]
Compression: gzip
Region: eu-central-1
S3KeyFormat: [IRRELEVANT]
StaticFilePath: true
StoreDirLimitSize: 500m
TotalFileSize: 20M
UploadTimeout: 60s
UsePutObject: true
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes (EKS) v1.33.5
- Server type and version:
- Operating System and version:
- Filters and plugins: Tail, but specifically S3 output relevant to this case.
Additional context
We have enabled RetryLimit: 5 on the output config, which appears to be smoothing over the transient failures quite nicely, and there are no longer stale files on disk. However, if a user is intentionally not retrying uploads or has not specified a retry limit on v4.2.3, they will be affected by this bug.
Bug Report
Describe the bug
On s3 failures, files are left stale on disk and thus do not relinquish output buffer space back by removing the file and decrementing
current_buffer_size. Repeated errors cause stale output files to persist in the buffer which the output handler is completely unaware of, so errors accumulate into a full output buffer and can no longer process any data.i.e. Occasional failures on S3 uploads (of which there are a lot) gradually stack up to consume every byte of
StoreDirLimitSizeuntil everything grinds to a complete halt and the pod gets killed.This appears to be related to these lines:
In both cases:
s3_store_file_inactiveis calledflb_fstore_file_inactives3_store_file_inactiveliberates buffer space, even though all references to that file have been relinquished and the output has 'given up' with the file.s3_store_file_deletewhich does release buffer space:ctx->current_buffer_size -= s3_file->size;and deletes the files3_store_file_inactiveis not relinquishing buffer space because it's not deleting the files.In conjunction with this, v4.2.3 also has an off-by-one error (fixed in later versions) so
retry_limit = 1is actually 'No retries', and this opt defaults to1rather than5as seen in previous versions. Thus, on v4.2.3 if you are not manually specifying a retryLimit > 1, transient failures become a catastrophic loss of records and the buffer will be permanently tainted with those lost records.To Reproduce
There's lots of correlative bits of information here, so bear with me:
Files on disk:
Error logs for latest error (redacted):
We have debug logging on at the moment, so we don't have logs dating back for all the stale files - but the stale files on disk correlate quite nicely with errors reported in the stats:
I actually don't know how to reproduce it without temporarily breaking S3 outbounds or something? This is just something we are actively experiencing.
Expected behavior
Files should be deleted rather than left stale on disk. I suspect it would be more appropriate to use
s3_store_file_deletein this case, unless there's a specific reason as to why fluent does not do this.Screenshots
n/a. All screenshots etc are included with bug logs/info.
Your Environment
Additional context
We have enabled
RetryLimit: 5on the output config, which appears to be smoothing over the transient failures quite nicely, and there are no longer stale files on disk. However, if a user is intentionally not retrying uploads or has not specified a retry limit on v4.2.3, they will be affected by this bug.