Skip to content

feat: add read_parquet, read_csv, read_json, read_avro SQL table functions#21367

Open
crm26 wants to merge 2 commits intoapache:mainfrom
crm26:feat/read-parquet-udtf
Open

feat: add read_parquet, read_csv, read_json, read_avro SQL table functions#21367
crm26 wants to merge 2 commits intoapache:mainfrom
crm26:feat/read-parquet-udtf

Conversation

@crm26
Copy link
Copy Markdown

@crm26 crm26 commented Apr 4, 2026

Summary

Adds four inline SQL table functions for ad-hoc file querying:

SELECT * FROM read_parquet('/path/to/*.parquet')
SELECT * FROM read_csv('/data/file.csv')
SELECT * FROM read_json('/data/file.json')
SELECT * FROM read_avro('/data/file.avro')

Closes #3773

Design

Each function is a thin TableFunctionImpl wrapper (~60 lines) over ListingTable:

  1. Extract path string from Expr::Literal
  2. Construct ListingOptions with the format's FileFormat
  3. Infer schema via blocking bridge
  4. Return ListingTable as TableProvider

Since the SQL planner wraps UDTF output as LogicalPlan::TableScan, all optimizer rules apply automatically:

  • Filter pushdown — verified via EXPLAIN test
  • Projection pushdown — verified via EXPLAIN test
  • Partition pruning — inherited from ListingTable

Async bridge

call_with_args is a sync fn but infer_schema is async. Uses std::thread::scope + Handle::block_on (not block_in_place) so it works on both multi-thread and current-thread Tokio runtimes. Tested with single-threaded runtime.

Feature gating

  • read_parquet — requires parquet feature (default on)
  • read_avro — requires avro feature (default off)
  • read_csv / read_json — always available (no heavy optional dependencies)

Limitations (v1)

  • Positional arguments only — no named args like has_header => true
  • No user-supplied schema override
  • No explicit Hive partition column specification
  • S3 paths require a registered object store

These can be addressed in follow-on PRs.

Tests

16 tests covering: basic read, filtered read, projection, aggregation, glob multi-file, error paths (no args, wrong type), filter/projection pushdown verification, and single-threaded runtime safety.

crm26 and others added 2 commits April 4, 2026 13:47
…tions

Built-in table functions that read files directly from SQL without
prior registration. Each wraps ListingTable with the appropriate
FileFormat, inheriting full optimizer support (filter pushdown,
projection pushdown, partition pruning, limit pushdown) automatically.

Parquet and Avro are feature-gated; CSV and JSON are always available.
Schema inference uses block_in_place to bridge the sync TableFunctionImpl
trait with async ListingOptions::infer_schema.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tions

Adds four inline SQL table functions that create ListingTable-backed
scans, enabling DuckDB-style ad-hoc file querying:

  SELECT * FROM read_parquet('/path/to/*.parquet')
  SELECT * FROM read_csv('/data/file.csv')
  SELECT * FROM read_json('/data/file.json')
  SELECT * FROM read_avro('/data/file.avro')

Each function is a thin wrapper (~60 lines) over ListingTable:
1. Extracts path from SQL literal argument
2. Constructs ListingOptions with the format's FileFormat
3. Infers schema via blocking bridge (std::thread::scope)
4. Returns ListingTable as TableProvider

Since the SQL planner wraps UDTF output as LogicalPlan::TableScan,
all optimizer rules apply automatically — filter pushdown, projection
pushdown, and partition pruning work out of the box.

Feature gating:
- read_parquet: requires `parquet` feature (default on)
- read_avro: requires `avro` feature (default off)
- read_csv/read_json: always available (no heavy deps)

Async bridge: uses std::thread::scope + Handle::block_on instead of
block_in_place, so it works on both multi-thread and current-thread
Tokio runtimes. Tested with single-threaded runtime.

16 tests covering: basic read, filtered read, projection, aggregation,
glob multi-file, error paths, filter/projection pushdown verification,
and single-threaded runtime safety.

Closes apache#3773
@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add read_parquet SQL UDF

1 participant