Adds retry support to the Amazon.Lambda.DurableExecution#2363
Adds retry support to the Amazon.Lambda.DurableExecution#2363GarrettBeatty wants to merge 1 commit into
Conversation
711bf82 to
4f05fa9
Compare
4f05fa9 to
54d18f9
Compare
54d18f9 to
599445f
Compare
599445f to
e7a85e4
Compare
e7a85e4 to
8f23ebb
Compare
8f23ebb to
e39e68e
Compare
e39e68e to
52055d3
Compare
ef44439 to
6bc97f2
Compare
6bc97f2 to
85eae3e
Compare
85eae3e to
0a32c0d
Compare
There was a problem hiding this comment.
Pull request overview
Builds on PR #2360 to add retry support to the Amazon.Lambda.DurableExecution SDK. Failed steps can now be retried with configurable backoff and jitter via service-mediated retries (the SDK checkpoints a RETRY operation and suspends the Lambda so the user is not billed during backoff). Adds at-most-once semantics for non-idempotent steps via a synchronously-flushed START checkpoint that allows crash detection on replay.
Changes:
- New public retry API:
IRetryStrategy,RetryDecision,RetryStrategyfactories (Default/Transient/None/Exponential/FromDelegate),JitterStrategy,StepSemantics, andStepConfig.RetryStrategy/StepConfig.Semantics. StepOperationaddsPENDING(retry-timer) andSTARTED(AtMostOnce crash-recovery) replay arms, aHandleStepFailureAsyncdecision tree, and START-checkpoint emission (sync for AtMostOnce, fire-and-forget for AtLeastOnce).- 21 new unit tests plus integration-test updates asserting
StepStartedevents and richer history logging.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
Config/IRetryStrategy.cs |
New strategy interface + RetryDecision struct |
Config/RetryStrategy.cs |
ExponentialRetryStrategy, DelegateRetryStrategy, JitterStrategy, StepSemantics, factories |
Config/StepConfig.cs |
Adds RetryStrategy and Semantics properties |
Internal/StepOperation.cs |
PENDING/STARTED replay arms, retry decision tree, START-checkpoint emission |
Internal/TerminationManager.cs |
Adds RetryScheduled termination reason |
Internal/CheckpointBatcher.cs |
Doc-only update describing fire-and-forget semantics |
Tests/RetryStrategyTests.cs |
14 unit tests for exponential math/jitter/filters/delegate |
Tests/DurableContextTests.cs |
6 retry/AtMostOnce/Pending replay tests |
Tests/DurableFunctionTests.cs |
Updated to assert START + SUCCEED + WAIT-START flat sequence |
IntegrationTests/*.cs |
Add StepStarted-event assertions; richer history dump in DurableFunctionDeployment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var history = await deployment.WaitForHistoryAsync( | ||
| arn!, | ||
| h => (h.Events?.Count(e => e.StepSucceededDetails != null) ?? 0) >= 2 | ||
| h => (h.Events?.Count(e => e.EventType == EventType.StepStarted) ?? 0) >= 2 |
There was a problem hiding this comment.
now that we are emitting START steps (which are needed for retries) we are asserting them in the IT tests
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
| /// Replay semantics — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// Replay branches — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// <list type="bullet"> | ||
| /// <item>Fresh: no prior state → run func → emit SUCCEED → return result.</item> |
There was a problem hiding this comment.
in previous PR only SUCCEEDED or FAILED mattered. But now for replays, we need to keep track of how many times the function was executed, which is done via the number of STARTED steps.
| public static class RetryStrategy | ||
| { | ||
| /// <summary>6 attempts, 2x backoff, 5s initial delay, 60s max, Full jitter.</summary> | ||
| public static IRetryStrategy Default { get; } = Exponential( |
There was a problem hiding this comment.
these defaults and values match javascript
| /// <see cref="CheckpointBatcher"/>; this type is the inbound side only. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// Replay tracking mirrors the Python / Java / JavaScript reference SDKs: |
There was a problem hiding this comment.
just updating docs here and everywhere to remove stuff like "similar to python/js"
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
Libraries/src/Amazon.Lambda.DurableExecution/RetryStrategy.cs:117
RegexOptions.Compiledis not trim/AOT-safe and can trigger IL3050 (RequiresDynamicCode) warnings, which this project treats as errors. Consider removingCompiledor gating it onRuntimeFeature.IsDynamicCodeSupported, and prefer a non-compiled regex option set (e.g., CultureInvariant/IgnoreCase as needed).
_backoffRate = backoffRate;
_jitter = jitter;
_retryableExceptions = retryableExceptions;
_retryableMessagePatterns = retryableMessagePatterns?
.Select(p => new Regex(p, RegexOptions.Compiled))
.ToArray();
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
| /// Wire-format <see cref="Operation.Type"/> string constants. | ||
| /// Plural name avoids collision with <c>Amazon.Lambda.OperationType</c>. | ||
| /// </summary> | ||
| public static class OperationTypes |
There was a problem hiding this comment.
| Our class | SDK equivalent | Wire values match? | Why we have it |
|---|---|---|---|
| OperationTypes | Amazon.Lambda.OperationType | Yes ("STEP", "WAIT", etc.) | We need string constants for switch arms on Operation.Type (string from JSON deserialization). The SDK type is class OperationType : ConstantClass, not string — its constants can't appear in C# switch cases against a string. |
| OperationStatuses | Amazon.Lambda.OperationStatus | Yes — but the SDK has TIMED_OUT and we don't (we never observe it on the wire today). | Same reason as above. |
| OperationSubTypes | doesn't exist in the SDK | n/a | The SDK ships no OperationSubType constant class. The wire values are PascalCase ("Step", "Wait") and were verified against the JS SDK definition. |
| OperationAction | Amazon.Lambda.OperationAction | Yes | We use the SDK's class directly (e.g., OperationAction.START) because we're constructing an Amazon.Lambda.Model.OperationUpdate — its Action property is typed as OperationAction, so the SDK constant is the right thing to assign. |
some of these constants i duplicated since we just need the string value. but since im adding more now, wondering if we want to use the ones in the model everywhere? it would require doing switch (op.Type) { case var t when t == OperationType.STEP.Value: ... }. V or similar everywhere to access the string value of each ConstantClass.
| /// are the scenarios that can actually fill a batch — today every batch is | ||
| /// 1 item with <see cref="FlushInterval"/> = Zero, so the gap is latent. | ||
| /// </remarks> | ||
| internal int MaxBatchBytes { get; init; } = 750 * 1024; |
There was a problem hiding this comment.
this max batch bytes thing is still a todo item in another pr
| { | ||
| if (!_isReplaying) return; | ||
|
|
||
| // Independent of IsReplaying: as long as a checkpoint record exists |
There was a problem hiding this comment.
this change catches this scenario
- Deploy v1: workflow calls
StepAsync(name: "fetch")first. - The step fails, the SDK writes a
RETRYcheckpoint, the service storesOperation { Id: hash("1"), Type: "STEP", Status: "PENDING" }, and re-invokes after the delay. - Before re-invoke, deploy v2: workflow now calls
WaitAsync(name: "fetch")first instead. (The user shouldn't do this, but it's the exact scenarioValidateReplayConsistencyis supposed to catch.) - Service re-invokes Lambda. The checkpoint envelope contains the PENDING step record at
hash("1")._isReplaying = false(no terminal ops, only the PENDING one). - User code reaches the first await:
WaitAsync(name: "fetch")athash("1").ValidateReplayConsistencyis called.
With the old if (!_isReplaying) return; short-circuit at the top, this validation would return early — even though _operations[hash("1")] exists with Type = "STEP" and our user is asking for type "WAIT". The mismatch slips through silently. Then ReplayAsync runs against a STEP record while user code expects a WAIT, producing weird downstream behavior.
Important1. _retryableMessagePatterns = retryableMessagePatterns?
.Select(p => new Regex(p, RegexOptions.Compiled))
.ToArray();RegexOptions.Compiled emits IL at runtime, which triggers IL3050 (RequiresDynamicCode) in trimmed/AOT deployments. Since this SDK explicitly targets Lambda AOT scenarios (the
if (DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() < scheduledMs) Using DateTimeOffset.UtcNow during replay means clock skew between invocations could cause a step to re-suspend when the service intended it to run (or vice versa). In
The author's inline comment explains why: in-progress ops (PENDING/READY/STARTED) don't set IsReplaying=true but their type/name still needs validation against code drift. This Nits
Questions
|
#2216
What
Adds retry support to
Amazon.Lambda.DurableExecutionon top of #2360. A step that throws can now be retried with configurable backoff and jitter. The Lambda suspends between attempts and is re-invoked by the service when the retry timer fires, so compute is not billed during the wait.Public API:
IRetryStrategyRetryDecisionIRetryStrategy.ShouldRetry—ShouldRetryflag plusDelay.RetryStrategyDefault,Transient,None,Exponential(...),FromDelegate(...).JitterStrategyNone/Half/Fullfor exponential backoff.StepSemanticsAtLeastOncePerRetry(default) /AtMostOncePerRetry.StepConfig.RetryStrategy,StepConfig.SemanticsStepInterruptedExceptionAtMostOncePerRetry.How
When a step throws,
StepOperation.HandleStepFailureAsynccallsIRetryStrategy.ShouldRetry(ex, attemptNumber). If the strategy says retry, the SDK writes aRETRYcheckpoint withNextAttemptDelaySecondsand suspends —RunAsyncreturnsPending. The service holds the execution until the delay elapses, then re-invokes us. On replay,StepOperation.ReplayAsyncsees thePENDINGstatus and either re-suspends (timer not yet up) or re-executes the step with an incremented attempt counter.AtLeastOncePerRetry(default)For idempotent steps. The SDK writes the
STARTcheckpoint as fire-and-forget — user code runs immediately, no waiting on the network round-trip. On success the SDK writesSUCCEED. The cached result is returned on every subsequent replay; user code never re-executes after success.If the Lambda crashes mid-attempt (before
SUCCEEDis recorded), replay seesSTARTEDand re-runs the same attempt under the same attempt counter. "At least once per retry" because the user's logic may run more than once for a singleattemptNumberif the host dies between START and the terminal record. This is the right default when the step is safe to repeat (a read, an idempotent PUT, a calculation).AtMostOncePerRetryFor non-idempotent steps (charging a card, sending an email, posting to a non-idempotent API). The SDK writes the
STARTcheckpoint synchronously — user code does not run until the service has acknowledged the START. The flush is correctness-load-bearing: a queued-but-unflushed START would be indistinguishable from "never ran" if the Lambda dies, and replay would re-execute the side effect.If the Lambda crashes between user code and the
SUCCEEDflush, the service sees aSTARTEDrecord with no terminal counterpart on the next invocation. Instead of re-running the step, the SDK synthesizes aStepInterruptedExceptionand routes it through the retry strategy — the strategy decides whether attempt N+1 should run or whether to give up. The user's code is invoked at most once perattemptNumber: either it ran to completion (SUCCEED/FAIL recorded), or the host died and that attempt is closed out as failed.User retry strategies can pattern-match on
StepInterruptedExceptionto decide whether crash-recovery should retry or surface the failure — useful when the side effect is non-idempotent enough that you'd rather fail than risk replaying.Choosing between them
AtLeastOncePerRetry. Use it unless your step has a side effect you can't safely repeat.AtMostOncePerRetrywhen the step calls a non-idempotent external API. The trade-off: one extra synchronous round-trip per attempt for the START flush.ExponentialRetryStrategysupports max attempts, initial/max delay, backoff rate, jitter, and exception filtering by type or message regex. Built-in factories:Default(6 attempts, 5s/60s, 2× backoff, full jitter),Transient(3 attempts, 1s/5s, half jitter),None.RetryStrategy.FromDelegate(...)covers arbitrary policies, including ones that branch onStepInterruptedException.Testing
21 new unit tests in
Amazon.Lambda.DurableExecution.Tests(130 total, up from 109 in #2360):RetryStrategyTests(14) — exponential backoff math, jitter, max-attempt exhaustion, exception-type and message-pattern filtering, delegate strategies.DurableContextTestsretry block (6) — checkpoint-and-suspend on retry, fail-without-strategy, retry exhaustion, future/pastPENDINGreplay,AtMostOncestart-flush ordering,STARTEDreplay routing through the retry handler.Integration tests in
Amazon.Lambda.DurableExecution.IntegrationTestsrun end-to-end against the durable-execution service:RetryTest— flaky step recovers on attempt 3.RetryExhaustionTest— step always throws, exhausts retries, surfaces FAILED with the original exception.AtMostOnceCrashReplayTest— Lambda is killed mid-attempt underAtMostOncePerRetry; service re-invokes, SDK routes through retry strategy, attempt 2 succeeds.LongRetryChainTest— five failures across multiple invocations, validates the wire-formatAttemptcounter is monotonic and matchesIStepContext.AttemptNumber.Out of scope (follow-up PRs)
MapAsync/ParallelAsync/RunInChildContextAsync/WaitForConditionAsyncCallbackAsync,InvokeAsyncDefaultJsonCheckpointSerializerDurableLoggerreplay-suppression (currentlyNullLogger)[DurableExecution]attributeDurableTestRunner/Amazon.Lambda.DurableExecution.Testingpackagedotnet new lambda.DurableFunctionblueprint