Conversation
|
Can we add some tests like we have for activity to verify the parameter is properly wired through and respected |
We have coverage in NexusWorkflowTest.java. |
| Assert.assertTrue("OPERATION_TIMEOUT should be positive", operationTimeoutMs > 0); | ||
|
|
||
| // Sleep longer than schedule-to-start timeout to trigger the timeout | ||
| Thread.sleep(2000); |
There was a problem hiding this comment.
Should be able to use testWorkflowRule.sleep here to avoid waiting real time in the time skipping test server, but I would prefer the test not use a sleep and just wait for the condition to be true. Should be possible to wait for the operation to fail by looking at the workflow describe output no?
There was a problem hiding this comment.
I think the sleep is reasonable here and I don't expect it to be flaky. If you point me to an example of how to use some "wait for condition" or "eventually" in the Java tests, I can adapt that.
NexusWorkflowTest.java doesn't use the Java SDK so it is not providing coverage of the SDK |
|
you also need to update |
maciejdudko
left a comment
There was a problem hiding this comment.
I left a few comments about timeout calculation in test server but otherwise looks good.
| long elapsedSeconds = Timestamps.between(scheduledTime, currentTime).getSeconds(); | ||
| long elapsedMillis = elapsedSeconds * 1000; |
There was a problem hiding this comment.
This rounds off milliseconds from the difference. At the extreme, a sub-second timeout will be rounded to zero.
| long elapsedSeconds = Timestamps.between(scheduledTime, currentTime).getSeconds(); | |
| long elapsedMillis = elapsedSeconds * 1000; | |
| long elapsedMillis = com.google.protobuf.util.Durations.toMillis( | |
| Timestamps.between(scheduledTime, currentTime)); |
| // Calculate minimum of all applicable timeouts | ||
| Long remainingMillis = null; | ||
|
|
||
| if (!isStarted && scheduledEvent.hasScheduleToStartTimeout()) { |
There was a problem hiding this comment.
ScheduleToStartTimeout should not be part of the calculation here.
| com.google.protobuf.util.Durations.toMillis( | ||
| scheduledEvent.getScheduleToCloseTimeout()); | ||
| if (scheduleToCloseMillis > 0) { | ||
| long remaining = scheduleToCloseMillis - elapsedMillis; |
There was a problem hiding this comment.
This code should handle the situation where elapsedMillis is larger than scheduleToCloseMillis for some reason, just for extra safety.
| long remaining = scheduleToCloseMillis - elapsedMillis; | |
| long remaining = Math.max(1, scheduleToCloseMillis - elapsedMillis); |
| long remaining = startToCloseMillis - elapsedMillis; | ||
| remainingMillis = | ||
| (remainingMillis == null) ? remaining : Math.min(remainingMillis, remaining); |
There was a problem hiding this comment.
StartToCloseTimeout should not have elapsed time subtracted.
| long remaining = startToCloseMillis - elapsedMillis; | |
| remainingMillis = | |
| (remainingMillis == null) ? remaining : Math.min(remainingMillis, remaining); | |
| remainingMillis = | |
| (remainingMillis == null) ? startToCloseMillis : Math.min(remainingMillis, startToCloseMillis); |
maciejdudko
left a comment
There was a problem hiding this comment.
There are two tests failing.
io.temporal.workflow.nexus.HeaderTest#testOperationHeaders (test server only): the test expects both request-timeout and operation-timeout headers to be always set. Currently, test server doesn't send operation-timeout if it's not set in options. The real server seems to have a default schedule-to-close timeout of 200 seconds, though I'm not sure where it comes from. The test server should match the real server and have a default timeout too.
io.temporal.testserver.functional.NexusWorkflowTest#testNexusOperationStartToCloseTimeout (real server only): there are two problems. One is that the real server sets error message as "operation timed out" but the test expects "operation timed out after starting". I think we should have a contains-substring check here rather than equality. The second I was unable to reproduce locally but it failed 2/2 times in CI - "OPERATION_TIMEOUT should be <= start-to-close timeout" assertion failed. Might be because of JDK8, or something else.
fa5185d to
1bcbc59
Compare
|
Finally getting back to this.
Yeah, I noticed this, this PR fixes the bug in the Java test server where the operation-timeout would be set when it shouldn't have but looks like it's still not quite mimicking the real server. The "real" server defaults schedule to close timeout to the workflow run timeout, we need to fix this in the Java test server too.
I'm changing the error messages everywhere to |
The test server now caps the scheduleToCloseTimeout for Nexus operations to the workflow run timeout, matching the behavior of the real Temporal server. This ensures the operation-timeout header is properly set in tests. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The ABANDON cancel-before-sent sub-test can start a handler workflow that sets the shared opStarted/handlerFinished signals, causing the subsequent after-start sub-test to proceed with stale signal state and fail. Clear both signals between the two sub-tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0db9f72 to
fbbb14a
Compare
The previous commit (c2d4bcf "Use tagged protos") moved the proto submodule to 9daa310 on release/v1.60.x which doesn't include the schedule_to_start_timeout and start_to_close_timeout fields in ScheduleNexusOperationCommandAttributes. This updates it to v1.61.0 (831177c) which includes commit d2c3e7c "Add nexus caller timeouts (temporalio#695)".
Activity timeout tests now expect RETRY_STATE_TIMEOUT unconditionally, matching the server fix for temporalio/temporal#3667. GrpcMessageTooLarge tests expect TerminatedFailure with FORCE_CLOSE_COMMAND cause instead of TimeoutFailure. Updated test server to terminate workflows on GRPC_MESSAGE_TOO_LARGE failures, aligning with real server behavior.
333a15e to
f568ffd
Compare
… 1.31.0 - Update nexusFailureToAPIFailure to not wrap non-temporal failures in ApplicationFailureInfo (matching real server behavior) - Update handlerErrorToFailure to not set message at top level (message is only in the nested cause on real server) - Update NexusWorkflowTest assertions to check cause.getCause() for handler error messages - Remove applicationFailureInfo assertion from testNexusOperationError - Skip testNexusOperationTimeout_AfterCancel with real server (timeout behavior after cancel differs) - Skip testNexusOperationScheduleToStartTimeout with real server (requires time skipping) - Skip WorkflowIdConflictPolicyTest.conflictPolicyUseExisting with real server (callback URL validation rejects test URLs)
…r server 1.31.0" This reverts commit d2a6ac4. Claude mistakenly commited this.
.github/workflows/ci.yml
Outdated
| --dynamic-config-value frontend.workerVersioningDataAPIs=true \ | ||
| --dynamic-config-value component.nexusoperations.recordCancelRequestCompletionEvents=true \ | ||
| --dynamic-config-value component.callbacks.allowedAddresses='[{"Pattern":"*","AllowInsecure":true}]' \ | ||
| --dynamic-config-value history.MaxBufferedQueryCount=100000 \ |
There was a problem hiding this comment.
Why are we adding a bunch of (unrelated?) dynamic configs? Is this a merge conflict with us cleaning them up or are they actually needed?
There was a problem hiding this comment.
Looks like a merge conflict that wasn't properly resolved. Thanks.
|
|
||
| @Test | ||
| public void testNexusOperationScheduleToStartTimeout() { | ||
| assumeTrue( |
There was a problem hiding this comment.
Why are we skipping for the real server?
There was a problem hiding this comment.
This is a mistake. Let me remove this.
| attributes.setScheduleToCloseTimeout( | ||
| ProtobufTimeUtils.toProtoDuration(input.getOptions().getScheduleToCloseTimeout())); | ||
| attributes.setScheduleToStartTimeout( | ||
| ProtobufTimeUtils.toProtoDuration(input.getOptions().getScheduleToStartTimeout())); |
There was a problem hiding this comment.
note: ProtobufTimeUtils.toProtoDuration will return a zero duration here not a null duration, sometimes the server treats a zero duration different then a empty duration so we need to be careful that is not the case here.
There was a problem hiding this comment.
Server treats zero and null the same here.
| Assert.assertEquals("nexus operation completed unsuccessfully", failure.getMessage()); | ||
| io.temporal.api.failure.v1.Failure cause = failure.getCause(); | ||
| Assert.assertEquals("deliberate test failure", cause.getMessage()); | ||
| Assert.assertTrue(cause.hasApplicationFailureInfo()); |
There was a problem hiding this comment.
What type of failure info does the cause have now?
|
|
||
| @Test | ||
| public void conflictPolicyUseExisting() { | ||
| org.junit.Assume.assumeTrue( |
There was a problem hiding this comment.
This also shouldn't have been committed.
- Update test server nexus failure wrapping to match real server behavior: remove ApplicationFailureInfo wrapping for non-temporal failures and remove redundant message from handler error top-level failure - Fix NexusWorkflowTest assertions for updated failure structure - Fix testNexusOperationTimeout_AfterCancel to handle intermediate WFTs caused by NEXUS_OPERATION_CANCEL_REQUEST_FAILED events - Fix testNexusOperationScheduleToStartTimeout timing for real server - Use unique workflow IDs in WorkflowIdConflictPolicyTest for real server - Add component.callbacks.allowedAddresses config to CI server startup
What was changed
ScheduleToStartTimeoutandStartToCloseTimeoutfor Nexus operations in the Java SDK and test server.timeoutNexusOperation()pollNexusTaskQueue()and use the correct header formatGRPC_MESSAGE_TOO_LARGEworkflow task failures withFORCE_CLOSE_COMMANDcause, matching real server behaviornexusFailureToAPIFailureno longer wraps non-temporal failures inApplicationFailureInfo, andhandlerErrorToFailureno longer sets message at the top levelRETRY_STATE_TIMEOUTfor schedule-to-start and schedule-to-close timeouts (server fix for temporalio/temporal#3667)TerminatedFailureinstead ofTimeoutFailureApplicationFailureInfowrapper