Skip to content

.NET: fix: Sporadic failure of Checkpoint Restore with PendingRequests tests#5086

Open
lokitoth wants to merge 2 commits intomainfrom
dev/dotnet_workflow/fix_flaky_checkpoint_restore_test
Open

.NET: fix: Sporadic failure of Checkpoint Restore with PendingRequests tests#5086
lokitoth wants to merge 2 commits intomainfrom
dev/dotnet_workflow/fix_flaky_checkpoint_restore_test

Conversation

@lokitoth
Copy link
Copy Markdown
Member

@lokitoth lokitoth commented Apr 3, 2026

Motivation and Context

A recent set of changes introduced or worsened a timing issue around checkpoint restore with pending requests, causing sporadic test failures due to early "halt" of event streaming.

Description

Add a check for pending events to properly sync up epochs.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • [ ] Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@lokitoth lokitoth self-assigned this Apr 3, 2026
@lokitoth lokitoth added .NET workflows Related to Workflows in agent-framework labels Apr 3, 2026
Copilot AI review requested due to automatic review settings April 3, 2026 19:41
@lokitoth lokitoth moved this to In Progress in Agent Framework Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a timing/epoch synchronization issue during checkpoint restore when there are pending external requests, which was causing sporadic early termination of event streaming and flaky tests in the workflows runtime.

Changes:

  • Adjusts StreamingRunEventStream.TakeEventStreamAsync epoch selection to treat pending requests/run-status as “expecting fresh work”.
  • Adds LockstepRunEventStream.UpdateStatus() and invokes it after checkpoint restore to keep RunStatus aligned with restored pending requests.
  • Removes channel completion on streaming run loop exit (but this introduces a potential hang risk; see PR comment).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs Updates epoch “fresh work” detection; changes run-loop shutdown behavior.
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs Adds a status refresh helper to reflect restored pending requests.
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/AsyncRunHandle.cs Calls the new lockstep status refresh after restoring a checkpoint.

@lokitoth lokitoth force-pushed the dev/dotnet_workflow/fix_flaky_checkpoint_restore_test branch from 4eb9abd to 812550b Compare April 3, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET workflows Related to Workflows in agent-framework

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants