fix(filter): bound per-Sub retry storm under sustained subscribe failures#1302
Conversation
…ures
Sustained subscribe failures saturated CPU, leaked 600+ subscriptionLoop
goroutines, and twice panicked with `strings: Join output length overflow`.
Five independent issues:
- api/filter: errcnt budget was gated on `possibleRecursiveError`, which
matched only `ErrNoPeersAvailable` / `swarm.ErrDialBackoff`. The dominant
error class never incremented errcnt, so the 3-error-per-5s budget was
dead code. Replaced gate with `shouldIncrementErrCnt(err)`: counts every
non-nil error.
- protocol/filter: WakuFilterLightNode.Subscribe flattened per-peer errors
via `fmt.Errorf+strings.Join`, losing typed *FilterError and growing
unboundedly. Replaced with typed `*SubscribeError` (PeerID, ContentTopics,
Err) plus `HasRateLimitError()`; `Error()` is hard-capped. Concurrent
per-peer appends now mutex-guarded.
- api/filter: 60-s rate-limit backoff on `*SubscribeError.HasRateLimitError()`.
`shouldHonourRateLimitBackoff(rateLimitedUntil, now)` gates ticker push and
closing-channel checkAndResubscribe. Cleared on subscribe success.
- api/filter: FilterManager.waitingToSubQueue was a cap-100 chan written and
drained under the same lock, deadlocking the manager once full. Replaced
with mutex-guarded slice.
- api/filter: Sub.cleanup closed DataCh while multiplex forwarders could
still be sending. Added multiplexWG awaited in cleanup; forwarder send is
in a select with apiSub.ctx.Done() so it can't deadlock when
subDetails.C is never closed (node-stop transitions).
Tests (all under -race):
- TestSub_CleanupRaceWithMultiplex (50 iter)
- TestSub_CleanupDoesNotDeadlockWhenSubChannelStaysOpen
- TestFilterManager_SubscribeFilter_DoesNotDeadlockWhenQueueFull
- TestShouldIncrementErrCnt
|
I cannot approve for some reason. |
darshankabariya
left a comment
There was a problem hiding this comment.
Strong fix,
Not blocking. just suggestion shouldIncrementErrCnt is always true — inline it.
|
@Ivansete-status @darshankabariya Pushed a new commit to reduce the verbosity. Please help me run the checks and merge if it's all good. |
|
@Ivansete-status @darshankabariya It's all green and ready to merge! Please hit the merge button. I don't have the rights. |
Description
Sustained subscribe failures saturated CPU, leaked 600+ subscriptionLoop
goroutines, and twice panicked with
strings: Join output length overflow.A few independent issues:
Changes
api/filter: errcnt budget was gated on
possibleRecursiveError, which matched onlyErrNoPeersAvailable/swarm.ErrDialBackoff. The dominant error class never incremented errcnt, so the 3-error-per-5s budget was dead code. Replaced gate withshouldIncrementErrCnt(err): counts every non-nil error.protocol/filter: WakuFilterLightNode.Subscribe flattened per-peer errors via
fmt.Errorf+strings.Join, losing typed *FilterError and growing unboundedly. Replaced with typed*SubscribeError(PeerID, ContentTopics, Err) plusHasRateLimitError();Error()is hard-capped. Concurrent per-peer appends now mutex-guarded.api/filter: 60-s rate-limit backoff on
*SubscribeError.HasRateLimitError().shouldHonourRateLimitBackoff(rateLimitedUntil, now)gates ticker push and closing-channel checkAndResubscribe. Cleared on subscribe success.api/filter: FilterManager.waitingToSubQueue was a cap-100 chan written and drained under the same lock, deadlocking the manager once full. Replaced with mutex-guarded slice.
api/filter: Sub.cleanup closed DataCh while multiplex forwarders could still be sending. Added multiplexWG awaited in cleanup; forwarder send is in a select with apiSub.ctx.Done() so it can't deadlock when subDetails.C is never closed (node-stop transitions).
Tests
Tests (all under -race):