Skip to content

Conversation

@arjav1528
Copy link
Contributor

Closes: #60261

Summary

This PR fixes the intermittent 405 Method Not Allowed error that edge workers encounter during startup in high-concurrency environments (e.g., 8 API servers × 64 worker processes).

Root Cause

The edge executor plugin was acquiring a global database lock (DBLocks.MIGRATIONS) at class definition time (plugin import) to create tables. When many API server processes started simultaneously, they all competed for the same lock, causing:

  1. psycopg2.errors.LockNotAvailable errors on lock timeout
  2. Plugin import failures → edge worker endpoints not registered
  3. Edge workers receiving 405 Method Not Allowed for /edge_worker/v1/worker/

Solution

  • Moved table creation from plugin import time to a FastAPI startup event
  • Plugin import no longer blocks on database operations
  • Added retry mechanism (5 attempts) for lock acquisition
  • Exponential backoff with jitter to avoid thundering herd
  • Proper logging for debugging
  • Check if tables exist before attempting to acquire lock
  • Skip lock entirely during normal operation (tables already exist)
  • Double-check pattern after lock acquisition to handle race conditions

@boring-cyborg boring-cyborg bot added area:dev-tools area:providers backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch provider:edge Edge Executor / Worker (AIP-69) / edge3 labels Jan 15, 2026
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investing efforts in a solution to prevent this.

Unfortunately in my view this adds a lot of complexity for the safety whereaswe have a clean solution for the problem. Providers can define DB migrations with the same tooling like the airflow core uses - alembic. The Fab provider uses this.

In edge we never implemented it like this because at time of starting that migration facility was not existing. Edge started with Airflow 2.10 but the migration tooling arrived only in Airflow 3.

Now as edge has dropped support for Airflow 2.x we can assume migration tooling is available from the core to leverage migrations in edge provider as well.

So long story short... instead of adding complexity and retries in loading the UI plugin I'd ppropose to implement a provider specific alembic migration package that is execute once on startup. The migration implementation in the plugin was always a workround. We should not further add complexity to the workaround.

See also notes in https://github.com/apache/airflow/blob/main/providers/edge3/src/airflow/providers/edge3/executors/edge_executor.py#L64

https://github.com/apache/airflow/blob/main/contributing-docs/14_metadata_database_updates.rst#how-to-hook-your-application-into-airflows-migration-process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are unrelated. Might be a good general improvement for typing... but please separate this to another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:providers backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch provider:edge Edge Executor / Worker (AIP-69) / edge3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants