Skip to content

fix: fold Limit/Sort into outer SELECT when Projection claims Aggregate through them#21375

Open
yonatan-sevenai wants to merge 9 commits intoapache:mainfrom
yonatan-sevenai:bug/unparse_over_aggregate
Open

fix: fold Limit/Sort into outer SELECT when Projection claims Aggregate through them#21375
yonatan-sevenai wants to merge 9 commits intoapache:mainfrom
yonatan-sevenai:bug/unparse_over_aggregate

Conversation

@yonatan-sevenai
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

When the SQL unparser encounters a Projection → Limit → Aggregate or Projection → Sort → Aggregate plan shape where aggregate aliases are inlined (no intermediate Projection between
Limit/Sort and Aggregate), it emits the aggregate expressions twice - once in the outer SELECT and once inside a spurious derived subquery. The outer SELECT also references columns that are
out of scope.

This plan shape doesn't occur from the SQL parser (which inserts an intermediate Projection), but optimizers and plan builders can produce it.

What changes are included in this PR?

datafusion/sql/src/unparser/plan.rs:

  • reconstruct_select_statement now returns Result - true when it found and claimed an Aggregate node for the current SELECT.
  • In the Projection arm of select_to_sql_recursively, after claiming an Aggregate, if the Projection's direct child is a Limit or Sort, fold its clauses (LIMIT/OFFSET or ORDER BY/LIMIT) into
    the current query and recurse into the child's input, skipping the Limit/Sort node. This prevents the Limit/Sort arm from seeing already_projected and wrapping everything in a spurious
    derived subquery.

datafusion/sql/tests/cases/plan_to_sql.rs:

  • roundtrip_aggregate_over_subquery — roundtrip test for the parser-generated plan shape (Projection between Limit and Aggregate), confirming it still works.
  • roundtrip_subquery_aggregate_with_column_alias - roundtrip test for SELECT id FROM (SELECT max(j1_id) FROM j1) AS c(id).
  • test_unparse_aggregate_over_subquery_no_inner_proj - manually constructed Projection → Limit → Aggregate plan, verifying the Limit is folded into the outer SELECT.
  • test_unparse_aggregate_no_outer_rename - manually constructed Projection → Aggregate with no outer rename, verifying aggregate aliases are preserved.
  • test_unparse_aggregate_with_sort_no_inner_proj - manually constructed Projection → Sort → Aggregate plan, verifying the Sort is folded into the outer SELECT.

Are these changes tested?

Yes. Five new tests covering:

  1. Parser-generated roundtrip (regression guard)
  2. Column alias roundtrip with subquery aggregate
  3. Limit folding with manually constructed plan (was the primary bug)
  4. Aggregate alias preservation without outer rename
  5. Sort folding with manually constructed plan (same bug pattern)

Are there any user-facing changes?

No API changes. The fix corrects SQL output for programmatically constructed logical plans that have Projection → Limit → Aggregate or Projection → Sort → Aggregate shapes without an intermediate Projection.

yonatan-sevenai and others added 9 commits March 22, 2026 00:06
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add roundtrip_aggregate_over_subquery and
test_unparse_aggregate_over_subquery_no_inner_proj to cover aggregate
expressions over subquery aliases. The latter demonstrates a bug where
the outer projection resolves aggregate expressions instead of column
references when the Projection between Limit and Aggregate is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds test_unparse_aggregate_no_outer_rename to verify behavior when the
outer Projection references aggregate columns without renaming them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…te through them

When a Projection's `reconstruct_select_statement` reaches through a
Limit or Sort to claim an Aggregate, the Limit/Sort arm would later see
`already_projected` and wrap everything in a spurious derived subquery,
emitting the aggregate twice.

Fix: in the Projection arm, after claiming the Aggregate, detect if the
direct child is a Limit or Sort. If so, fold its clauses (LIMIT/OFFSET
or ORDER BY) into the current query and recurse into the Limit/Sort's
child, skipping the node entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the sql SQL Planner label Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unparser emits aggregate twice when Projection reaches through Limit/Sort to claim Aggregate

1 participant