Skip to content

Add ParallelAsync for concurrent branch execution (DOTNET-8662)#2375

Draft
GarrettBeatty wants to merge 12 commits into
gcbeatty/durable-child-contextfrom
gcbeatty/durable-parallel
Draft

Add ParallelAsync for concurrent branch execution (DOTNET-8662)#2375
GarrettBeatty wants to merge 12 commits into
gcbeatty/durable-child-contextfrom
gcbeatty/durable-parallel

Conversation

@GarrettBeatty
Copy link
Copy Markdown
Contributor

@GarrettBeatty GarrettBeatty commented May 14, 2026

#2216

Summary

Adds parallel branch execution to the .NET Durable Execution SDK. ParallelAsync runs N branches concurrently with configurable concurrency limits and completion policies, returning an IBatchResult<T> with per-branch status and error information.

Per-branch checkpoint payloads are serialized via the ILambdaSerializer registered on ILambdaContext.Serializer (typically configured through LambdaBootstrapBuilder.Create(handler, serializer)), matching the StepAsync / RunInChildContextAsync pattern. There are no separate reflection / AOT-safe overload pairs: the AOT story is determined entirely by which serializer the user registers with the runtime (e.g., SourceGeneratorLambdaJsonSerializer<TContext> for AOT scenarios).

Stacked on top of #2372 (Wave 0 cross-cutting types).

Fixes DOTNET-8662.

The shared IBatchResult<T> family added here will be reused by MapAsync (Wave 2).

Public surface

  • IDurableContext.ParallelAsync<T> (2 overloads: Func[] vs DurableBranch<T>[])
  • DurableBranch<T> record (Name + Func)
  • ParallelConfig (MaxConcurrency, CompletionConfig, NestingType)
  • CompletionConfig with factories AllSuccessful() / FirstSuccessful() / AllCompleted(); ToleratedFailureCount / ToleratedFailurePercentage (validated 0.0-1.0)
  • IBatchResult<T> with All / Succeeded / Failed / Started accessors, GetResults, GetErrors, ThrowIfError, HasFailure, CompletionReason, count properties
  • IBatchItem<T> with Index, Name, Status, Result, Error
  • BatchItemStatus { Succeeded, Failed, Started }
  • CompletionReason { AllCompleted, MinSuccessfulReached, FailureToleranceExceeded }
  • NestingType (Nested default; Flat throws NotSupportedException - reserved for a follow-up)
  • ParallelException (carries IBatchResult; future-subclassable)

Internal

  • ParallelOperation<T> orchestrator dispatches branches with optional semaphore-bounded concurrency. Each branch runs as a ChildContextOperation<T> with a deterministic ID via OperationIdGenerator.CreateChild.
  • Branch failures aggregated as IBatchItem<T> entries; orchestrator throws ParallelException only when CompletionConfig signals FailureToleranceExceeded.
  • ExecutionState now thread-safe (lock around reads/writes of _operations, _visitedOperations, _isReplaying). Required for concurrent branch replay; affects all operations but no regressions.
  • ParallelOperation awaits Task.WhenAll(inFlight) before disposing the semaphore so cancellation/exception during dispatch lets in-flight branches settle cleanly.
  • Reuses OperationSubTypes.Parallel / OperationSubTypes.ParallelBranch from Wave 0.

Test plan

  • Build clean (zero warnings, TreatWarningsAsErrors enforced) on net8.0 and net10.0
  • 31 new unit tests pass alongside existing 161, for 192 total, including:
    • CompletionConfig matrix (AllSuccessful, AllCompleted, FirstSuccessful, MinSuccessful, ToleratedFailureCount, ToleratedFailurePercentage)
    • Cancel-mid-dispatch regression test (no orphan branches)
    • Concurrent ExecutionState access regression test
    • Replay determinism, mixed-status replay, FirstSuccessful all-fail
  • 6 new integration tests build successfully (require AWS credentials to run)

Generated with Claude Code


COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-parallel branch from 19c0128 to fa13eef Compare May 14, 2026 21:49
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch from 464c591 to d308c3b Compare May 14, 2026 21:49
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-parallel branch from fa13eef to b7a06b4 Compare May 14, 2026 22:19
GarrettBeatty and others added 11 commits May 15, 2026 18:14
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with serializer hook (retry deferred to follow-up PR)
- ICheckpointSerializer interface
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- IRetryStrategy, ExponentialRetryStrategy, retry decision factories
- DefaultJsonCheckpointSerializer
- DurableLogger replay-suppression (currently returns NullLogger)
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2

remove

update

update

update

update
Match the Python / Java / JavaScript reference SDKs' replay-mode model:
the workflow is "replaying" iff it has not yet revisited every
checkpointed completed user-replayable operation. A single global flag
flipped on the first fresh op (the prior model) misclassified workflow-
body code that runs before the first step and would not generalize to
Map/Parallel/Callback later.

ExecutionState changes:
- Replace `Mode`/`ExecutionMode`/`EnterExecutionMode()` with `IsReplaying`
  + `TrackReplay(operationId)`.
- Initial replay decision: any non-EXECUTION op present means we're
  replaying. The service always sends an EXECUTION-type op carrying the
  input payload — that's bookkeeping, not user history, so it does not
  count toward replay (matches Python execution.py:258, Java
  ExecutionManager:81, JS execution-context.ts:62).
- TrackReplay flips IsReplaying false once every checkpointed terminal-
  status non-EXECUTION op has been visited. Terminal set matches
  Python's: SUCCEEDED, FAILED, CANCELLED, STOPPED.

Operation changes:
- DurableOperation.ExecuteAsync calls TrackReplay(OperationId) at the
  top, so every operation participates in visit accounting without each
  subclass needing to remember.
- StepOperation/WaitOperation drop their manual EnterExecutionMode calls.

Tests:
- ExecutionStateTests rewritten around IsReplaying/TrackReplay, including
  pinning regressions: only-EXECUTION-op ⇒ NotReplaying, all-visited ⇒
  flips out of replay, PENDING ops do not block transition, idempotency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Serializer

DurableExecution now reads the registered ILambdaSerializer from the per-invocation
ILambdaContext (added in the prior PR) for both step-result checkpointing and
workflow input/output. AOT-safety is now determined entirely by which serializer
the user registers with LambdaBootstrapBuilder.Create — there is no longer a
forked path between reflection-based and AOT-safe APIs.

Removed:
- ICheckpointSerializer<T> + SerializationContext record
- ReflectionJsonCheckpointSerializer<T>
- The four JsonSerializerContext-taking overloads of DurableFunction.WrapAsync
- The IDurableContext.StepAsync overload that took ICheckpointSerializer<T>
- All [RequiresUnreferencedCode]/[RequiresDynamicCode] attributes and their
  related [UnconditionalSuppressMessage] shims

Net result: 8 WrapAsync overloads → 4, 3 StepAsync overloads → 2, zero trim
attributes in the public API. The AOT smoke test continues to publish with zero
IL2026/IL3050 warnings.
- Wrap LambdaDurableServiceClient SDK calls in DurableExecutionException with
  durable-execution context (which call, which ARN). User logs no longer show
  bare AWSSDK stack traces. Update IsTerminalCheckpointError to unwrap the
  inner AmazonServiceException for classification.
- Move public-API files out of Models/, Config/, Exceptions/ into the project
  root so folder layout matches the Amazon.Lambda.DurableExecution namespace.
- Replace string action literals ("SUCCEED", "FAIL", "START") with the
  Amazon.Lambda.OperationAction enum constants.
- Replace hand-rolled ToHex with Amazon.Util.AWSSDKUtils.ToHex. Drop the
  netstandard2.0 SHA-256 fallback now that DurableExecution targets net8+.
- Spell "iff" as "if and only if" in ExecutionState replay-mode docs.

Tests updated for the new wrapping shape: terminal classification asserts on
DurableExecutionException with the inner SDK exception preserved; transient
and hydration paths assert ThrowsAsync<DurableExecutionException> with
InnerException set to the original AmazonServiceException.
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
Adds child-context support to the .NET Durable Execution SDK. A child
context is a logical sub-workflow with its own deterministic
operation-ID space, persisted as a CONTEXT operation so subsequent
invocations replay the cached value without re-executing the function.

Public surface:
- IDurableContext.RunInChildContextAsync<T> (reflection + AOT-safe
  ICheckpointSerializer<T> overloads, plus a void overload).
- ChildContextConfig with SubType (observability label) and
  ErrorMapping (transform exceptions before they surface to the caller).
- ChildContextException for failure surfacing.

Used as a building block for upcoming WaitForCallbackAsync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays down shared types/constants for the upcoming durable-execution
context operations (Callbacks, Invoke, Parallel, Map, WaitForCondition)
and updates the design doc to match decisions reached after comparing
against the Python, JS, and Java reference SDKs.

SDK changes:
- OperationSubTypes constants class (Step, Wait, Callback, WaitForCallback,
  Invoke, WaitForCondition, Parallel, ParallelBranch, Map, MapIteration).
  Replaces hard-coded SubType literals in StepOperation and WaitOperation.
- OperationStatuses.TimedOut for callback/invoke timeout handling.

Design-doc alignment:
- Drop Serializer field from CallbackConfig, InvokeConfig,
  ChildContextConfig. Custom serializers flow through AOT-safe
  ICheckpointSerializer<T> overloads (matches the existing StepConfig
  pattern documented at line 1247).
- InvokeConfig gains TenantId (matches Python/JS/Java); drops
  PayloadSerializer / ResultSerializer.
- BatchItemStatus.Cancelled -> Started. The SDK does not synchronously
  cancel branches; the wire state of items still in flight when the
  batch resolves (e.g., FirstSuccessful short-circuit) is STARTED.
  Matches Python and JS.
- IBatchResult<T> expanded to the full JS/Python surface: adds Started,
  GetErrors(), HasFailure, SuccessCount, FailureCount, StartedCount,
  TotalCount.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch from d308c3b to be4c3ad Compare May 18, 2026 15:23
Adds parallel branch execution to the .NET Durable Execution SDK.
ParallelAsync runs N branches concurrently with configurable concurrency
limits and completion policies, returning an IBatchResult<T> with
per-branch status and error information.

Per-branch checkpoint payloads are serialized via the ILambdaSerializer
registered on ILambdaContext.Serializer (typically configured through
LambdaBootstrapBuilder.Create(handler, serializer)), matching the
StepAsync / RunInChildContextAsync pattern. There are no separate
reflection / AOT-safe overload pairs: the AOT story is determined
entirely by which serializer the user registers with the runtime.

Public surface:
- IDurableContext.ParallelAsync<T> (2 overloads: Func[] vs
  DurableBranch<T>[])
- DurableBranch<T> record (Name + Func)
- ParallelConfig (MaxConcurrency, CompletionConfig, NestingType)
- CompletionConfig with factories AllSuccessful() / FirstSuccessful() /
  AllCompleted(); ToleratedFailureCount / ToleratedFailurePercentage
  (validated 0.0-1.0)
- IBatchResult<T> with All / Succeeded / Failed / Started accessors,
  GetResults, GetErrors, ThrowIfError, HasFailure, CompletionReason,
  count properties
- IBatchItem<T> with Index, Name, Status, Result, Error
- BatchItemStatus { Succeeded, Failed, Started }
- CompletionReason { AllCompleted, MinSuccessfulReached,
  FailureToleranceExceeded }
- NestingType (Nested default; Flat throws NotSupportedException - reserved)
- ParallelException (carries IBatchResult; future-subclassable)

Internal:
- ParallelOperation<T> orchestrator dispatches branches with optional
  semaphore-bounded concurrency. Each branch runs as a
  ChildContextOperation<T> with deterministic ID via
  OperationIdGenerator.CreateChild.
- Branch failures aggregated as IBatchItem<T> entries; orchestrator
  throws ParallelException only when CompletionConfig signals
  FailureToleranceExceeded.
- Parent CONTEXT checkpoint records summary (CompletionReason +
  per-branch index/name/status); branch results live on per-branch
  CONTEXT checkpoints.
- ExecutionState now thread-safe (lock around reads/writes of
  _operations, _visitedOperations, _isReplaying). Required for
  concurrent branch replay; affects all operations but no regressions.
- ParallelOperation awaits Task.WhenAll(inFlight) before disposing
  the semaphore so cancellation/exception during dispatch lets
  in-flight branches settle cleanly.
- Reuses OperationSubTypes.Parallel / OperationSubTypes.ParallelBranch
  from Wave 0.

Adds 31 unit tests + 6 integration tests covering CompletionConfig
matrix, MaxConcurrency, FirstSuccessful short-circuit, replay
determinism, mixed-status replay, cancellation, and concurrency
stress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-parallel branch from b7a06b4 to 08b2095 Compare May 18, 2026 15:44
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch 3 times, most recently from ad4d208 to 3acbed5 Compare May 20, 2026 17:46
Base automatically changed from gcbeatty/durable-wave0 to gcbeatty/durable-child-context May 20, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants