Skip to content

fix(filter): bound per-Sub retry storm under sustained subscribe failures#1302

Merged
siddarthkay merged 2 commits into
logos-messaging:masterfrom
alexjba:fix/filters
May 18, 2026
Merged

fix(filter): bound per-Sub retry storm under sustained subscribe failures#1302
siddarthkay merged 2 commits into
logos-messaging:masterfrom
alexjba:fix/filters

Conversation

@alexjba
Copy link
Copy Markdown
Collaborator

@alexjba alexjba commented May 14, 2026

Description

Sustained subscribe failures saturated CPU, leaked 600+ subscriptionLoop
goroutines, and twice panicked with strings: Join output length overflow.
A few independent issues:

  • The 3-error-per-5 s retry budget was gated on possibleRecursiveError, which only matched utils.ErrNoPeersAvailable and swarm.ErrDialBackoff. In 1014 observed failures it incremented zero times — the dominant *errors.errorString aggregate wrapper bypassed it. Every subscribe failure pushed the closing channel and re-entered subscribe, producing ~1100 retries/sec.
  • WakuFilterLightNode.Subscribe aggregated per-peer failures via fmt.Errorf("…%s", strings.Join(failedContentTopics, ",")) — losing typed *FilterError (so the apiSub couldn't see HTTP 429) and growing unboundedly until it twice panicked strings: Join output length overflow. Replaced with typed *SubscribeError. Also closed a pre-existing data race on concurrent appends.
  • Waku peers return 429 "filter request rejected due rate limit exceeded" to ask clients to slow down. Pre-fix the apiSub never saw the signal (Bug 2 hid it) and hammered the peer. Added 60 s backoff window via rateLimitedUntil field + shouldHonourRateLimitBackoff predicate gating both retry-trigger paths.

Changes

  • api/filter: errcnt budget was gated on possibleRecursiveError, which matched only ErrNoPeersAvailable / swarm.ErrDialBackoff. The dominant error class never incremented errcnt, so the 3-error-per-5s budget was dead code. Replaced gate with shouldIncrementErrCnt(err): counts every non-nil error.

  • protocol/filter: WakuFilterLightNode.Subscribe flattened per-peer errors via fmt.Errorf+strings.Join, losing typed *FilterError and growing unboundedly. Replaced with typed *SubscribeError (PeerID, ContentTopics, Err) plus HasRateLimitError(); Error() is hard-capped. Concurrent per-peer appends now mutex-guarded.

  • api/filter: 60-s rate-limit backoff on *SubscribeError.HasRateLimitError(). shouldHonourRateLimitBackoff(rateLimitedUntil, now) gates ticker push and closing-channel checkAndResubscribe. Cleared on subscribe success.

  • api/filter: FilterManager.waitingToSubQueue was a cap-100 chan written and drained under the same lock, deadlocking the manager once full. Replaced with mutex-guarded slice.

  • api/filter: Sub.cleanup closed DataCh while multiplex forwarders could still be sending. Added multiplexWG awaited in cleanup; forwarder send is in a select with apiSub.ctx.Done() so it can't deadlock when subDetails.C is never closed (node-stop transitions).

Tests

Tests (all under -race):

  • TestSub_CleanupRaceWithMultiplex (50 iter)
  • TestSub_CleanupDoesNotDeadlockWhenSubChannelStaysOpen
  • TestFilterManager_SubscribeFilter_DoesNotDeadlockWhenQueueFull
  • TestShouldIncrementErrCnt

…ures

  Sustained subscribe failures saturated CPU, leaked 600+ subscriptionLoop
  goroutines, and twice panicked with `strings: Join output length overflow`.
  Five independent issues:

  - api/filter: errcnt budget was gated on `possibleRecursiveError`, which
    matched only `ErrNoPeersAvailable` / `swarm.ErrDialBackoff`. The dominant
    error class never incremented errcnt, so the 3-error-per-5s budget was
    dead code. Replaced gate with `shouldIncrementErrCnt(err)`: counts every
    non-nil error.

  - protocol/filter: WakuFilterLightNode.Subscribe flattened per-peer errors
    via `fmt.Errorf+strings.Join`, losing typed *FilterError and growing
    unboundedly. Replaced with typed `*SubscribeError` (PeerID, ContentTopics,
    Err) plus `HasRateLimitError()`; `Error()` is hard-capped. Concurrent
    per-peer appends now mutex-guarded.

  - api/filter: 60-s rate-limit backoff on `*SubscribeError.HasRateLimitError()`.
    `shouldHonourRateLimitBackoff(rateLimitedUntil, now)` gates ticker push and
    closing-channel checkAndResubscribe. Cleared on subscribe success.

  - api/filter: FilterManager.waitingToSubQueue was a cap-100 chan written and
    drained under the same lock, deadlocking the manager once full. Replaced
    with mutex-guarded slice.

  - api/filter: Sub.cleanup closed DataCh while multiplex forwarders could
    still be sending. Added multiplexWG awaited in cleanup; forwarder send is
    in a select with apiSub.ctx.Done() so it can't deadlock when
    subDetails.C is never closed (node-stop transitions).

  Tests (all under -race):
  - TestSub_CleanupRaceWithMultiplex (50 iter)
  - TestSub_CleanupDoesNotDeadlockWhenSubChannelStaysOpen
  - TestFilterManager_SubscribeFilter_DoesNotDeadlockWhenQueueFull
  - TestShouldIncrementErrCnt
@Ivansete-status
Copy link
Copy Markdown
Collaborator

I cannot approve for some reason.
This PR LGTM.
In logos-delivery we are just disconnecting the peer and trying another one, in case of failure.
See: https://github.com/logos-messaging/logos-delivery/blob/cb35b59f951a8b6ba3815fd7f0cfb95c99192753/waku/node/delivery_service/subscription_manager.nim
We can adopt this backoff behavior too if you consider appropriate @igor-sirotin

@darshankabariya darshankabariya self-requested a review May 15, 2026 17:46
Copy link
Copy Markdown

@darshankabariya darshankabariya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong fix,
Not blocking. just suggestion shouldIncrementErrCnt is always true — inline it.

@alexjba
Copy link
Copy Markdown
Collaborator Author

alexjba commented May 15, 2026

@Ivansete-status @darshankabariya Pushed a new commit to reduce the verbosity.

Please help me run the checks and merge if it's all good.

@alexjba
Copy link
Copy Markdown
Collaborator Author

alexjba commented May 17, 2026

@Ivansete-status @darshankabariya It's all green and ready to merge! Please hit the merge button. I don't have the rights.

@siddarthkay siddarthkay merged commit d6d3939 into logos-messaging:master May 18, 2026
10 checks passed
alexjba added a commit to status-im/status-go that referenced this pull request May 18, 2026
alexjba added a commit to status-im/status-go that referenced this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants