Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419
Open
dmholtz wants to merge 3 commits intoflyteorg:masterfrom
Open
Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419dmholtz wants to merge 3 commits intoflyteorg:masterfrom
dmholtz wants to merge 3 commits intoflyteorg:masterfrom
Conversation
This PR highlights additional settings for multi-node trainings on K8s through a remark in the Elastic docstring to avoid TCPStore timeouts if the zero worker starts slower than any other worker. Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
kumare3
approved these changes
Apr 8, 2026
fg91
reviewed
Apr 8, 2026
| be assigned to a running node which might have the image in its cache while other workers might require a node scale up and image pull. | ||
| When using the default `torch.distributed.elastic.rendezvous.c10d_rendezvous_backend.C10dRendezvousBackend`, consider also increasing | ||
| the TCPStore `read_timeout`, e.g., {"timeout": 900, "join_timeout": 900, "read_timeout": 900}, as its default value of 60 seconds | ||
| might be too tight if the zero-worker starts slower than any other worker. |
Member
There was a problem hiding this comment.
Nit:
This is mostly relevant when not using gang scheduling, we could mention this here.
Author
There was a problem hiding this comment.
@fg91 Thanks for this hint, I added an additional remark about its relevancy.
Member
There was a problem hiding this comment.
With "zero-worker" you mean rank 0, right?
Author
There was a problem hiding this comment.
With "zero-worker", I mean the pod with the name {flyte-task-id}-worker-0. Not the rank 0 process of torch DDP.
Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
f458f49 to
611c3f7
Compare
fg91
reviewed
Apr 9, 2026
fg91
approved these changes
Apr 9, 2026
Member
fg91
left a comment
There was a problem hiding this comment.
Two nits but LG apart from that, thanks!
Co-authored-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com> Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
d92acc6 to
6bb5ca1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR highlights additional settings for configuring multi-node trainings with flytekit-kf-pytorch.
Why are the changes needed?
The default
read_timeoutof the default C10dRendezvousBackend in multi-node trainings with flytekit-kf-pytorch is 60 seconds, which might be too tight, if the zero worker starts slower than any other worker.To avoid users of flytekit-kf-pytorch being confused by obscure timeout errors during startup of such elastic Pytorch tasks, we add additional hints for configuration that avoid such errors.
What changes were proposed in this pull request?
This PR adds a remark to set explicitly increase the
read_timeoutof the TCPStore used by the C10dRendezvousBackend, which is the default for multi-node training with flyte-kf-pytorch.I decided not to change the any defaults in the
rdzv_configdictionary to remain agnostic of the chosen backend.How was this patch tested?
This PR touches only documentation, so no tests are required.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link