Skip to content

Nexus caller timeouts#2760

Open
bergundy wants to merge 17 commits intotemporalio:masterfrom
bergundy:nexus-caller-timeouts
Open

Nexus caller timeouts#2760
bergundy wants to merge 17 commits intotemporalio:masterfrom
bergundy:nexus-caller-timeouts

Conversation

@bergundy
Copy link
Member

@bergundy bergundy commented Jan 14, 2026

What was changed

  • Added support for ScheduleToStartTimeout and StartToCloseTimeout for Nexus operations in the Java SDK and test server.
  • In the test server:
    • Cap secondary timeouts to scheduleToClose timeout
    • Support all three timeout types in timeoutNexusOperation()
    • Calculate OPERATION_TIMEOUT header dynamically in pollNexusTaskQueue() and use the correct header format
    • Terminate workflows on GRPC_MESSAGE_TOO_LARGE workflow task failures with FORCE_CLOSE_COMMAND cause, matching real server behavior
    • Updated nexus failure wrapping to match real server: nexusFailureToAPIFailure no longer wraps non-temporal failures in ApplicationFailureInfo, and handlerErrorToFailure no longer sets message at the top level
  • Added tests for the new functionality
  • Bumped Temporal Server to 1.31.0 (CLI v1.6.1-server-1.31.0-151.0) and proto submodule to v1.61.0
  • Updated tests for server behavior changes:
    • Activity timeout retry state is now RETRY_STATE_TIMEOUT for schedule-to-start and schedule-to-close timeouts (server fix for temporalio/temporal#3667)
    • Workflows terminated due to gRPC message too large now fail with TerminatedFailure instead of TimeoutFailure
    • Nexus operation error failures no longer include ApplicationFailureInfo wrapper
    • Nexus handler error failure messages are now in the nested cause (not the top-level failure)
    • Skipped tests that require test-server-specific behavior (time skipping, callback URL validation) when running against real server

@bergundy bergundy requested a review from a team as a code owner January 14, 2026 00:57
@Quinn-With-Two-Ns
Copy link
Contributor

Can we add some tests like we have for activity

to verify the parameter is properly wired through and respected

@bergundy
Copy link
Member Author

Can we add some tests like we have for activity

to verify the parameter is properly wired through and respected

We have coverage in NexusWorkflowTest.java.

Assert.assertTrue("OPERATION_TIMEOUT should be positive", operationTimeoutMs > 0);

// Sleep longer than schedule-to-start timeout to trigger the timeout
Thread.sleep(2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to use testWorkflowRule.sleep here to avoid waiting real time in the time skipping test server, but I would prefer the test not use a sleep and just wait for the condition to be true. Should be possible to wait for the operation to fail by looking at the workflow describe output no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sleep is reasonable here and I don't expect it to be flaky. If you point me to an example of how to use some "wait for condition" or "eventually" in the Java tests, I can adapt that.

@Quinn-With-Two-Ns
Copy link
Contributor

We have coverage in NexusWorkflowTest.java.

NexusWorkflowTest.java doesn't use the Java SDK so it is not providing coverage of the SDK

@Quinn-With-Two-Ns
Copy link
Contributor

Quinn-With-Two-Ns commented Jan 14, 2026

you also need to update constructPendingNexusOperationInfo in the test server I think, unless I missed that.

Copy link
Contributor

@maciejdudko maciejdudko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments about timeout calculation in test server but otherwise looks good.

Comment on lines 981 to 982
long elapsedSeconds = Timestamps.between(scheduledTime, currentTime).getSeconds();
long elapsedMillis = elapsedSeconds * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rounds off milliseconds from the difference. At the extreme, a sub-second timeout will be rounded to zero.

Suggested change
long elapsedSeconds = Timestamps.between(scheduledTime, currentTime).getSeconds();
long elapsedMillis = elapsedSeconds * 1000;
long elapsedMillis = com.google.protobuf.util.Durations.toMillis(
Timestamps.between(scheduledTime, currentTime));

// Calculate minimum of all applicable timeouts
Long remainingMillis = null;

if (!isStarted && scheduledEvent.hasScheduleToStartTimeout()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScheduleToStartTimeout should not be part of the calculation here.

com.google.protobuf.util.Durations.toMillis(
scheduledEvent.getScheduleToCloseTimeout());
if (scheduleToCloseMillis > 0) {
long remaining = scheduleToCloseMillis - elapsedMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should handle the situation where elapsedMillis is larger than scheduleToCloseMillis for some reason, just for extra safety.

Suggested change
long remaining = scheduleToCloseMillis - elapsedMillis;
long remaining = Math.max(1, scheduleToCloseMillis - elapsedMillis);

Comment on lines 1002 to 1004
long remaining = startToCloseMillis - elapsedMillis;
remainingMillis =
(remainingMillis == null) ? remaining : Math.min(remainingMillis, remaining);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartToCloseTimeout should not have elapsed time subtracted.

Suggested change
long remaining = startToCloseMillis - elapsedMillis;
remainingMillis =
(remainingMillis == null) ? remaining : Math.min(remainingMillis, remaining);
remainingMillis =
(remainingMillis == null) ? startToCloseMillis : Math.min(remainingMillis, startToCloseMillis);

Copy link
Contributor

@maciejdudko maciejdudko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two tests failing.

io.temporal.workflow.nexus.HeaderTest#testOperationHeaders (test server only): the test expects both request-timeout and operation-timeout headers to be always set. Currently, test server doesn't send operation-timeout if it's not set in options. The real server seems to have a default schedule-to-close timeout of 200 seconds, though I'm not sure where it comes from. The test server should match the real server and have a default timeout too.

io.temporal.testserver.functional.NexusWorkflowTest#testNexusOperationStartToCloseTimeout (real server only): there are two problems. One is that the real server sets error message as "operation timed out" but the test expects "operation timed out after starting". I think we should have a contains-substring check here rather than equality. The second I was unable to reproduce locally but it failed 2/2 times in CI - "OPERATION_TIMEOUT should be <= start-to-close timeout" assertion failed. Might be because of JDK8, or something else.

@bergundy bergundy force-pushed the nexus-caller-timeouts branch 3 times, most recently from fa5185d to 1bcbc59 Compare February 4, 2026 22:40
@bergundy
Copy link
Member Author

bergundy commented Feb 5, 2026

Finally getting back to this.

There are two tests failing.

io.temporal.workflow.nexus.HeaderTest#testOperationHeaders (test server only): the test expects both request-timeout and operation-timeout headers to be always set. Currently, test server doesn't send operation-timeout if it's not set in options. The real server seems to have a default schedule-to-close timeout of 200 seconds, though I'm not sure where it comes from. The test server should match the real server and have a default timeout too.

Yeah, I noticed this, this PR fixes the bug in the Java test server where the operation-timeout would be set when it shouldn't have but looks like it's still not quite mimicking the real server. The "real" server defaults schedule to close timeout to the workflow run timeout, we need to fix this in the Java test server too.

io.temporal.testserver.functional.NexusWorkflowTest#testNexusOperationStartToCloseTimeout (real server only): there are two problems. One is that the real server sets error message as "operation timed out" but the test expects "operation timed out after starting". I think we should have a contains-substring check here rather than equality. The second I was unable to reproduce locally but it failed 2/2 times in CI - "OPERATION_TIMEOUT should be <= start-to-close timeout" assertion failed. Might be because of JDK8, or something else.

I'm changing the error messages everywhere to "operation timed out". We need to upgrade the CLI to include a server that supports the new timeouts before merging this.

bergundy and others added 11 commits February 18, 2026 10:59
The test server now caps the scheduleToCloseTimeout for Nexus operations
to the workflow run timeout, matching the behavior of the real Temporal
server. This ensures the operation-timeout header is properly set in tests.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The ABANDON cancel-before-sent sub-test can start a handler workflow that
sets the shared opStarted/handlerFinished signals, causing the subsequent
after-start sub-test to proceed with stale signal state and fail. Clear
both signals between the two sub-tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bergundy bergundy force-pushed the nexus-caller-timeouts branch from 0db9f72 to fbbb14a Compare February 18, 2026 18:59
The previous commit (c2d4bcf "Use tagged protos") moved the proto submodule
to 9daa310 on release/v1.60.x which doesn't include the schedule_to_start_timeout
and start_to_close_timeout fields in ScheduleNexusOperationCommandAttributes.
This updates it to v1.61.0 (831177c) which includes commit d2c3e7c
"Add nexus caller timeouts (temporalio#695)".
Activity timeout tests now expect RETRY_STATE_TIMEOUT unconditionally,
matching the server fix for temporalio/temporal#3667. GrpcMessageTooLarge
tests expect TerminatedFailure with FORCE_CLOSE_COMMAND cause instead of
TimeoutFailure. Updated test server to terminate workflows on
GRPC_MESSAGE_TOO_LARGE failures, aligning with real server behavior.
@bergundy bergundy force-pushed the nexus-caller-timeouts branch from 333a15e to f568ffd Compare February 18, 2026 21:10
… 1.31.0

- Update nexusFailureToAPIFailure to not wrap non-temporal failures in
  ApplicationFailureInfo (matching real server behavior)
- Update handlerErrorToFailure to not set message at top level (message
  is only in the nested cause on real server)
- Update NexusWorkflowTest assertions to check cause.getCause() for
  handler error messages
- Remove applicationFailureInfo assertion from testNexusOperationError
- Skip testNexusOperationTimeout_AfterCancel with real server (timeout
  behavior after cancel differs)
- Skip testNexusOperationScheduleToStartTimeout with real server
  (requires time skipping)
- Skip WorkflowIdConflictPolicyTest.conflictPolicyUseExisting with real
  server (callback URL validation rejects test URLs)
…r server 1.31.0"

This reverts commit d2a6ac4.

Claude mistakenly commited this.
--dynamic-config-value frontend.workerVersioningDataAPIs=true \
--dynamic-config-value component.nexusoperations.recordCancelRequestCompletionEvents=true \
--dynamic-config-value component.callbacks.allowedAddresses='[{"Pattern":"*","AllowInsecure":true}]' \
--dynamic-config-value history.MaxBufferedQueryCount=100000 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding a bunch of (unrelated?) dynamic configs? Is this a merge conflict with us cleaning them up or are they actually needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a merge conflict that wasn't properly resolved. Thanks.


@Test
public void testNexusOperationScheduleToStartTimeout() {
assumeTrue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we skipping for the real server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake. Let me remove this.

attributes.setScheduleToCloseTimeout(
ProtobufTimeUtils.toProtoDuration(input.getOptions().getScheduleToCloseTimeout()));
attributes.setScheduleToStartTimeout(
ProtobufTimeUtils.toProtoDuration(input.getOptions().getScheduleToStartTimeout()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: ProtobufTimeUtils.toProtoDuration will return a zero duration here not a null duration, sometimes the server treats a zero duration different then a empty duration so we need to be careful that is not the case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server treats zero and null the same here.

Assert.assertEquals("nexus operation completed unsuccessfully", failure.getMessage());
io.temporal.api.failure.v1.Failure cause = failure.getCause();
Assert.assertEquals("deliberate test failure", cause.getMessage());
Assert.assertTrue(cause.hasApplicationFailureInfo());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of failure info does the cause have now?


@Test
public void conflictPolicyUseExisting() {
org.junit.Assume.assumeTrue(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also shouldn't have been committed.

- Update test server nexus failure wrapping to match real server behavior:
  remove ApplicationFailureInfo wrapping for non-temporal failures and
  remove redundant message from handler error top-level failure
- Fix NexusWorkflowTest assertions for updated failure structure
- Fix testNexusOperationTimeout_AfterCancel to handle intermediate WFTs
  caused by NEXUS_OPERATION_CANCEL_REQUEST_FAILED events
- Fix testNexusOperationScheduleToStartTimeout timing for real server
- Use unique workflow IDs in WorkflowIdConflictPolicyTest for real server
- Add component.callbacks.allowedAddresses config to CI server startup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments