feat: add read_parquet, read_csv, read_json, read_avro SQL table functions#21367
Open
crm26 wants to merge 2 commits intoapache:mainfrom
Open
feat: add read_parquet, read_csv, read_json, read_avro SQL table functions#21367crm26 wants to merge 2 commits intoapache:mainfrom
crm26 wants to merge 2 commits intoapache:mainfrom
Conversation
…tions Built-in table functions that read files directly from SQL without prior registration. Each wraps ListingTable with the appropriate FileFormat, inheriting full optimizer support (filter pushdown, projection pushdown, partition pruning, limit pushdown) automatically. Parquet and Avro are feature-gated; CSV and JSON are always available. Schema inference uses block_in_place to bridge the sync TableFunctionImpl trait with async ListingOptions::infer_schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tions
Adds four inline SQL table functions that create ListingTable-backed
scans, enabling DuckDB-style ad-hoc file querying:
SELECT * FROM read_parquet('/path/to/*.parquet')
SELECT * FROM read_csv('/data/file.csv')
SELECT * FROM read_json('/data/file.json')
SELECT * FROM read_avro('/data/file.avro')
Each function is a thin wrapper (~60 lines) over ListingTable:
1. Extracts path from SQL literal argument
2. Constructs ListingOptions with the format's FileFormat
3. Infers schema via blocking bridge (std::thread::scope)
4. Returns ListingTable as TableProvider
Since the SQL planner wraps UDTF output as LogicalPlan::TableScan,
all optimizer rules apply automatically — filter pushdown, projection
pushdown, and partition pruning work out of the box.
Feature gating:
- read_parquet: requires `parquet` feature (default on)
- read_avro: requires `avro` feature (default off)
- read_csv/read_json: always available (no heavy deps)
Async bridge: uses std::thread::scope + Handle::block_on instead of
block_in_place, so it works on both multi-thread and current-thread
Tokio runtimes. Tested with single-threaded runtime.
16 tests covering: basic read, filtered read, projection, aggregation,
glob multi-file, error paths, filter/projection pushdown verification,
and single-threaded runtime safety.
Closes apache#3773
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds four inline SQL table functions for ad-hoc file querying:
Closes #3773
Design
Each function is a thin
TableFunctionImplwrapper (~60 lines) overListingTable:Expr::LiteralListingOptionswith the format'sFileFormatListingTableasTableProviderSince the SQL planner wraps UDTF output as
LogicalPlan::TableScan, all optimizer rules apply automatically:ListingTableAsync bridge
call_with_argsis a sync fn butinfer_schemais async. Usesstd::thread::scope+Handle::block_on(notblock_in_place) so it works on both multi-thread and current-thread Tokio runtimes. Tested with single-threaded runtime.Feature gating
read_parquet— requiresparquetfeature (default on)read_avro— requiresavrofeature (default off)read_csv/read_json— always available (no heavy optional dependencies)Limitations (v1)
has_header => trueThese can be addressed in follow-on PRs.
Tests
16 tests covering: basic read, filtered read, projection, aggregation, glob multi-file, error paths (no args, wrong type), filter/projection pushdown verification, and single-threaded runtime safety.