Skip to content

Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419

Open
dmholtz wants to merge 3 commits intoflyteorg:masterfrom
dmholtz:dmholtz/improve-elastic-docstring
Open

Improve docstring of Elastic dataclass in flytekit-kf-pytorch#3419
dmholtz wants to merge 3 commits intoflyteorg:masterfrom
dmholtz:dmholtz/improve-elastic-docstring

Conversation

@dmholtz
Copy link
Copy Markdown

@dmholtz dmholtz commented Apr 8, 2026

This PR highlights additional settings for configuring multi-node trainings with flytekit-kf-pytorch.

Why are the changes needed?

The default read_timeout of the default C10dRendezvousBackend in multi-node trainings with flytekit-kf-pytorch is 60 seconds, which might be too tight, if the zero worker starts slower than any other worker.
To avoid users of flytekit-kf-pytorch being confused by obscure timeout errors during startup of such elastic Pytorch tasks, we add additional hints for configuration that avoid such errors.

What changes were proposed in this pull request?

This PR adds a remark to set explicitly increase the read_timeout of the TCPStore used by the C10dRendezvousBackend, which is the default for multi-node training with flyte-kf-pytorch.

I decided not to change the any defaults in the rdzv_config dictionary to remain agnostic of the chosen backend.

How was this patch tested?

This PR touches only documentation, so no tests are required.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

This PR highlights additional settings for multi-node trainings on K8s through a remark in the Elastic docstring to avoid TCPStore timeouts if the zero worker starts slower than any other worker.

Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
@dmholtz dmholtz marked this pull request as ready for review April 8, 2026 09:01
be assigned to a running node which might have the image in its cache while other workers might require a node scale up and image pull.
When using the default `torch.distributed.elastic.rendezvous.c10d_rendezvous_backend.C10dRendezvousBackend`, consider also increasing
the TCPStore `read_timeout`, e.g., {"timeout": 900, "join_timeout": 900, "read_timeout": 900}, as its default value of 60 seconds
might be too tight if the zero-worker starts slower than any other worker.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
This is mostly relevant when not using gang scheduling, we could mention this here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg91 Thanks for this hint, I added an additional remark about its relevancy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "zero-worker" you mean rank 0, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "zero-worker", I mean the pod with the name {flyte-task-id}-worker-0. Not the rank 0 process of torch DDP.

Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
@dmholtz dmholtz force-pushed the dmholtz/improve-elastic-docstring branch from f458f49 to 611c3f7 Compare April 9, 2026 19:12
Comment thread plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py Outdated
Copy link
Copy Markdown
Member

@fg91 fg91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits but LG apart from that, thanks!

Co-authored-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com>
Signed-off-by: David Holtz <56723830+dmholtz@users.noreply.github.com>
@dmholtz dmholtz force-pushed the dmholtz/improve-elastic-docstring branch from d92acc6 to 6bb5ca1 Compare April 10, 2026 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants