Skip to content

Add direct GCS export to DatabricksSqlOperator with Parquet/Avro support #55128

@pauldouane

Description

@pauldouane

Description

Add a feature to the DatabricksSqlOperator that allows direct export of query results to a Google Cloud Storage (GCS) bucket in Parquet and Avro formats.

Use case/motivation

The current DatabricksSqlOperator allows executing SQL queries and saving the results to a file using the output_path and output_format parameters. However, it has a few limitations for common data engineering workflows:

It only supports csv, json, and jsonl formats. Parquet and Avro are widely used for their performance and schema handling.

The output is saved to the Databricks cluster's local filesystem (/tmp/), not directly to an object storage like GCS. This requires an additional step (a separate Airflow task, or in-Databricks logic) to move the file to its final destination, adding unnecessary complexity to the DAG.

This feature would streamline simple ETL/ELT pipelines by allowing a single Airflow task to:

Execute a SQL query on a Databricks warehouse.

Export the result as a Parquet or Avro file.

Save the file directly to a specified GCS path.

This would eliminate the need for an intermediate COPY command or a separate Databricks job (DatabricksSubmitRunOperator) for simple export scenarios, simplifying DAGs and improving readability.

Proposed Change:

Introduce new output_format values: Add support for 'parquet' and 'avro' to the output_format parameter.

Enhance output_path to support object storage URIs: The output_path parameter should be able to accept object storage URIs (e.g., gs://my-bucket/path/to/data.parquet).

Implement the export logic: The operator's internal logic would need to be updated to handle the conversion of the SQL query results to the specified format and stream them directly to the GCS location using the appropriate Databricks Spark APIs (e.g., spark.read.sql(...).write.format("parquet").save(...)).

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions