Skip to content

Add automatic retry with backoff for transient failures (timeouts, 5xx) #597

@dhimmel

Description

@dhimmel

ai written but based on an error we observed in production logs, edited lightly by me

The WorkOS Python SDK has no retry logic for transient failures. When a WorkOS API call encounters a timeout or transient server error, it fails immediately on the first attempt. This is especially problematic for operations like authenticate_with_refresh_token, which are inherently idempotent and safe to retry.

We hit this in production today. Our call to authenticate_with_refresh_token connected to api.workos.com successfully, but WorkOS never sent response headers. After 25 seconds (the SDK's DEFAULT_REQUEST_TIMEOUT), httpx.ReadTimeout was raised and our user's auth flow was broken.

A single automatic retry would have resolved this transparently.

Current behavior

AsyncHTTPClient.request() in workos/utils/http_client.py makes a single request with no retry:

response = await self._client.request(**prepared_request_parameters)

Transient failures — httpx.TimeoutException, httpx.ConnectError, HTTP 429, HTTP 5xx — all fail immediately.

Additionally, these transport-level exceptions are not caught or wrapped in WorkOS exception types, so they bubble up as raw httpx errors. Consumers catching workos.exceptions.BaseRequestException won't catch timeouts.

Expected behavior

Comparison to other auth/identity SDKs:

SDK Default retries Retryable conditions
Auth0 Python 2 408, 429, 5xx, connection errors
AWS SDKs 3 429, 5xx, connection errors
WorkOS Python 0 None

A reasonable default would be 2-3 retries with exponential backoff for:

  • httpx.TimeoutException and httpx.ConnectError
  • HTTP 429 (respecting Retry-After header)
  • HTTP 500, 502, 503, 504

This could be implemented via httpx's transport-level retry or a simple retry loop in request().

Workaround

Wrapping SDK calls with our own try/except httpx.ReadTimeout, but this requires knowing about httpx internals — the SDK's exception hierarchy should abstract transport errors away.

Environment

  • workos v5.45.0
  • httpx v0.28.1
  • Python 3.14
  • Async client (AsyncHTTPClient)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions