Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jan 16, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

This commit extends null-aware anti join functionality to support multiple columns, enabling queries like:

SELECT * FROM t1 WHERE (a, b) NOT IN (SELECT x, y FROM t2);

and correlated multi-column NOT IN subqueries:

SELECT * FROM t1 WHERE (c2, c3) NOT IN (
  SELECT c2, c3 FROM t2 WHERE t1.c1 = t2.c1
);

Changes:

Physical Execution Layer:

  • Remove single-column validation restriction in HashJoinExec
  • Extend NULL detection in probe phase to check ANY column for NULLs
  • Extend NULL filtering in final phase to filter rows with ANY NULL column - Add comprehensive unit tests for 2-column and 3-column joins

SQL Planning Layer:

  • Allow tuple expressions in parse_in_subquery()
  • Add validation for tuple field count matching

Query Optimization Layer:

  • Update InSubquery validation to allow struct expressions
  • Skip type coercion for struct expressions (handled in decorrelation)
  • Implement struct decomposition in decorrelate_predicate_subquery
  • Decompose struct(a, b) into individual join conditions a = x AND b = y
  • Handle both correlated and non-correlated multi-column subqueries

Test Coverage:

  • Add 7 new SQL logic test cases (Tests 19-25)
  • Add 3 unit test functions with 15 test variants (5 batch sizes each)
  • Cover 2-column, 3-column, empty subquery, and NULL patterns
  • Include correlated multi-column NOT IN from issue DataFusion HashJoin LeftAnti doesn't support null aware anti join #10583
  • Add test coverage for multi-column IN subqueries to verify that the
    struct expression support works correctly for both negated (NOT IN)
    and non-negated (IN) cases.

Are these changes tested?

Are there any user-facing changes?

viirya and others added 2 commits January 16, 2026 10:04
This commit extends null-aware anti join functionality to support
multiple columns, enabling queries like:

  SELECT * FROM t1 WHERE (a, b) NOT IN (SELECT x, y FROM t2);

and correlated multi-column NOT IN subqueries:

  SELECT * FROM t1 WHERE (c2, c3) NOT IN (
    SELECT c2, c3 FROM t2 WHERE t1.c1 = t2.c1
  );

Changes:

Physical Execution Layer:
- Remove single-column validation restriction in HashJoinExec
- Extend NULL detection in probe phase to check ANY column for NULLs
- Extend NULL filtering in final phase to filter rows with ANY NULL column
- Add comprehensive unit tests for 2-column and 3-column joins

SQL Planning Layer:
- Allow tuple expressions in parse_in_subquery()
- Add validation for tuple field count matching

Query Optimization Layer:
- Update InSubquery validation to allow struct expressions
- Skip type coercion for struct expressions (handled in decorrelation)
- Implement struct decomposition in decorrelate_predicate_subquery
  - Decompose struct(a, b) into individual join conditions a = x AND b = y
  - Handle both correlated and non-correlated multi-column subqueries

Test Coverage:
- Add 7 new SQL logic test cases (Tests 19-25)
- Add 3 unit test functions with 15 test variants (5 batch sizes each)
- Cover 2-column, 3-column, empty subquery, and NULL patterns
- Include correlated multi-column NOT IN from issue apache#10583

Test Results:
- 31/31 null-aware anti join tests passing
- 369/369 total hash join tests passing
- All optimizer tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Add test coverage for multi-column IN subqueries to verify that the
struct expression support works correctly for both negated (NOT IN)
and non-negated (IN) cases.

Tests added to subquery.slt:
- Test 1: Basic two-column IN
- Test 2: Multi-column IN with no matches
- Test 3: Multi-column IN with NULL values (verifies non-null-aware behavior)
- Test 4: Three-column IN
- Test 5: Correlated multi-column IN
- Test 6: Verify logical plan shows LeftSemi with multiple join conditions
- Test 7: Multi-column IN with empty subquery
- Test 8: Multi-column IN with WHERE clause in subquery

These tests complement the multi-column NOT IN tests in
null_aware_anti_join.slt and verify that struct decomposition
(converting `(a, b) IN (SELECT x, y ...)` into `a = x AND b = y`)
works correctly for LeftSemi joins.

Key differences from NOT IN:
- IN uses LeftSemi join (not null-aware)
- IN does not use CollectLeft partition mode
- NULL values don't match in regular semi joins (two-valued logic)

Related to multi-column null-aware anti join implementation.
@viirya viirya changed the title Multi column null aware anti join feat: Add multi-column support for null-aware anti joins Jan 16, 2026
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jan 16, 2026
@viirya viirya force-pushed the multi-column-null-aware-anti-join branch from 8f96467 to f61eabb Compare January 16, 2026 20:04
- Collapse nested if statement in invariants.rs (clippy::collapsible_if)
- Collapse nested if statement in hash_join/exec.rs (clippy::collapsible_if)
- Use unwrap_or_else instead of unwrap_or for function calls in
  decorrelate_predicate_subquery.rs (clippy::or_fun_call)
@viirya viirya force-pushed the multi-column-null-aware-anti-join branch from f61eabb to f6db769 Compare January 16, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules physical-plan Changes to the physical-plan crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants