feat: add scoped physical expression traversal#21386
feat: add scoped physical expression traversal#21386rluvaton wants to merge 17 commits intoapache:mainfrom
Conversation
9f1d7c6 to
c8d2648
Compare
|
@gstvg how would that affect your lambda support, would that make it easier or harder to implement column capture? note that in column capture you preferably should only keep the columns that are used to avoid a lot of copies |
| /// Get a list of child PhysicalExpr that provide the input for this expr. | ||
| fn children(&self) -> Vec<&Arc<dyn PhysicalExpr>>; | ||
|
|
||
| /// Get a list of child PhysicalExpr that provide the input for this expr that are in the same scope as this expression. |
There was a problem hiding this comment.
I think this API is confusing because "scope" isn't something that is defined
I think a beter description of what this API is doing is "children in the outermost scope"
All built in expressions have the (implicit) property that their children are evaluated in the same input schema. I don't see how CASE is any different in theory (though I understand it is what is causing you current problems)
There was a problem hiding this comment.
I also think if we add thsi API it is easy to misue and you'll be chasing down corner case bugs with other expression types that have the same problem
I believe it wouldn't make much difference, that's okay
Thanks, I will make sure to implement that |
Mostly tests
Which issue does this PR close?
Relates to #21231
Rationale for this change
CaseWhen(and other expression utilities) discover input columns by traversing the expression tree viachildren(). This breaks when a child expression introduces a new naming scope — for example, a lambda expression likearray_transform(list, item -> item + external_col)whereitemis bound to a different schema than the outer expression.Today, the only workaround is to not return scoped children from
children(), but that breaks other legitimate use cases (e.g. finding all expressions in a tree, wrapping every expression with debug/timing, etc.). We need a way to distinguish "same-scope children" from "all children".What changes are included in this PR?
New
ScopedTreeNodetrait (datafusion/common/src/tree_node.rs):TreeNodewith the concept of scopes — a subtree boundary that traversals should not cross unless intendedapply_in_scope,visit_in_scope,rewrite_in_scope,transform_down_in_scope,transform_up_in_scope,transform_down_up_in_scope,exists_in_scopeDynScopedTreeNodeforArc<dyn T>blanket implementationsNew methods on
PhysicalExprtrait:children_in_scope()— returns only children that share the same input schema as the current expression. Defaults tochildren(), so no breaking change for existing implementationswith_new_children_in_scope()— reconstructs the expression with new scoped children. Defaults towith_new_children()Updated expression utilities to use scoped traversal:
CaseWhencolumn projection now usesapply_in_scope/transform_down_in_scopeso it no longer traverses into lambda bodies or other new-scope childrencollect_columnsand related helpers inphysical-expr/src/utilsuse scoped traversal where appropriateNo changes to non-scoped traversals —
TreeNode::visit, display/formatting, and other tree walks that intentionally need the full tree continue to usechildren().Are these changes tested?
Yes:
ScopedTreeNodetraversal onTestTreeNodewith mixed scoped/non-scoped childrenCaseWhendoes not incorrectly project columns that belong to a different scope (e.g. lambda variables)Are there any user-facing changes?
Yes, but no breaking changes:
ScopedTreeNodeindatafusion-commonchildren_in_scope()andwith_new_children_in_scope()onPhysicalExpr